Biblioteca Digital

This paper proposes a novel Hybrid Clustering approach for XML documents (HCX) that first determines the structural similarity in the form of frequent subtrees and then uses these frequent subtrees to represent the constrained content of the XML documents in order to determine the content similarity. The empirical analysis reveals that the proposed method is scalable and accurate.

Veja mais

XCFS - a novel approach for clustering XML documents using both the structure and the content

Relevância:

20.00% 20.00%

Publicador:

Resumo:

XML document clustering is essential for many document handling applications such as information storage, retrieval, integration and transformation. An XML clustering algorithm should process both the structural and the content information of XML documents in order to improve the accuracy and meaning of the clustering solution. However, the inclusion of both kinds of information in the clustering process results in a huge overhead for the underlying clustering algorithm because of the high dimensionality of the data. This paper introduces a novel approach that first determines the structural similarity in the form of frequent subtrees and then uses these frequent subtrees to represent the constrained content of the XML documents in order to determine the content similarity. The proposed method reduces the high dimensionality of input data by using only the structure-constrained content. The empirical analysis reveals that the proposed method can effectively cluster even very large XML datasets and outperform other existing methods.

Veja mais

Effective pattern taxonomy mining in text documents

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Many data mining techniques have been proposed for mining useful patterns in databases. However, how to effectively utilize discovered patterns is still an open research issue, especially in the domain of text mining. Most existing methods adopt term-based approaches. However, they all suffer from the problems of polysemy and synonymy. This paper presents an innovative technique, pattern taxonomy mining, to improve the effectiveness of using discovered patterns for finding useful information. Substantial experiments on RCV1 demonstrate that the proposed solution achieves encouraging performance.

Veja mais

On the relevance of documents for semantic representation

Relevância:

20.00% 20.00%

Publicador:

Veja mais

Overview of the INEX 2009 XML mining track : clustering and classification of XML documents

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This report explains the objectives, datasets and evaluation criteria of both the clustering and classification tasks set in the INEX 2009 XML Mining track. The report also describes the approaches and results obtained by the different participants.

Veja mais

Development of audiovisual industries in the Northern Rivers Region of NSW

Relevância:

20.00% 20.00%

Publicador:

Veja mais

Catch me if you can : the effective service of bankruptcy documents in a changing world

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This article discusses some recent judicial decisions to assist legal practitioners to overcome some of the problems encountered when serving Bankruptcy Notices and Creditor’s Petitions. Some of the issues covered in the discussion are: What the valid last-known address of the debtor can be, whether a Bankruptcy Notice can be validly served by email on a debtor who is located outside Australia, whether service of a Bankruptcy Notice is valid when the debtor is outside Australia when service on the debtor occurs in Australia, whether the creditor’s failure to obtain leave for service of a Bankruptcy Notice can be excused, what can be done regarding personal service of a Creditor’s Petition when a debtor is outside Australia and whether the Court can set aside a sequestration order. The article goes on to place the issues in the context of broader bankruptcy policies noting that effective service of bankruptcy documents is challenging in a world where mobility of debtors is global and new modes of communication ever changing.

Veja mais

Multi-version documents : A digitisation solution for textual cultural artefacts

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Textual cultural heritage artefacts present two serious problems for the encoder: how to record different or revised versions of the same work, and how to encode conflicting perspectives of the text using markup. Both are forms of textual variation, and can be accurately recorded using a multi-version document, based on a minimally redundant directed graph that cleanly separates variation from content.

Veja mais

XML documents clustering using tensor space model - A preliminary study

Relevância:

20.00% 20.00%

Publicador:

Resumo:

A hierarchical structure is used to represent the content of the semi-structured documents such as XML and XHTML. The traditional Vector Space Model (VSM) is not sufficient to represent both the structure and the content of such web documents. Hence in this paper, we introduce a novel method of representing the XML documents in Tensor Space Model (TSM) and then utilize it for clustering. Empirical analysis shows that the proposed method is scalable for a real-life dataset as well as the factorized matrices produced from the proposed method helps to improve the quality of clusters due to the enriched document representation with both the structure and the content information.

Veja mais

Overview of the INEX 2010 XML mining track : clustering and classification of XML documents

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The XML Document Mining track was launched for exploring two main ideas: (1) identifying key problems and new challenges of the emerging field of mining semi-structured documents, and (2) studying and assessing the potential of Machine Learning (ML) techniques for dealing with generic ML tasks in the structured domain, i.e., classification and clustering of semi-structured documents. This track has run for six editions during INEX 2005, 2006, 2007, 2008, 2009 and 2010. The first five editions have been summarized in previous editions and we focus here on the 2010 edition. INEX 2010 included two tasks in the XML Mining track: (1) unsupervised clustering task and (2) semi-supervised classification task where documents are organized in a graph. The clustering task requires the participants to group the documents into clusters without any knowledge of category labels using an unsupervised learning algorithm. On the other hand, the classification task requires the participants to label the documents in the dataset into known categories using a supervised learning algorithm and a training set. This report gives the details of clustering and classification tasks.

Veja mais

XML documents clustering using Tensor Space Model

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The traditional Vector Space Model (VSM) is not able to represent both the structure and the content of XML documents. This paper introduces a novel method of representing XML documents in a Tensor Space Model (TSM) and then utilizing it for clustering. Empirical analysis shows that the proposed method is scalable for large-sized datasets; as well, the factorized matrices produced from the proposed method help to improve the quality of clusters through the enriched document representation of both structure and content information.

Veja mais

Selected new training documents to update user profile

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Relevance Feedback (RF) has been proven very effective for improving retrieval accuracy. Adaptive information filtering (AIF) technology has benefited from the improvements achieved in all the tasks involved over the last decades. A difficult problem in AIF has been how to update the system with new feedback efficiently and effectively. In current feedback methods, the updating processes focus on updating system parameters. In this paper, we developed a new approach, the Adaptive Relevance Features Discovery (ARFD). It automatically updates the system's knowledge based on a sliding window over positive and negative feedback to solve a nonmonotonic problem efficiently. Some of the new training documents will be selected using the knowledge that the system currently obtained. Then, specific features will be extracted from selected training documents. Different methods have been used to merge and revise the weights of features in a vector space. The new model is designed for Relevance Features Discovery (RFD), a pattern mining based approach, which uses negative relevance feedback to improve the quality of extracted features from positive feedback. Learning algorithms are also proposed to implement this approach on Reuters Corpus Volume 1 and TREC topics. Experiments show that the proposed approach can work efficiently and achieves the encouragement performance.

Veja mais

53 resultados para Audiovisual documents

Filtro por publicador