987 resultados para Documents électroniques
Resumo:
XML document clustering is essential for many document handling applications such as information storage, retrieval, integration and transformation. An XML clustering algorithm should process both the structural and the content information of XML documents in order to improve the accuracy and meaning of the clustering solution. However, the inclusion of both kinds of information in the clustering process results in a huge overhead for the underlying clustering algorithm because of the high dimensionality of the data. This paper introduces a novel approach that first determines the structural similarity in the form of frequent subtrees and then uses these frequent subtrees to represent the constrained content of the XML documents in order to determine the content similarity. The proposed method reduces the high dimensionality of input data by using only the structure-constrained content. The empirical analysis reveals that the proposed method can effectively cluster even very large XML datasets and outperform other existing methods.
Resumo:
Many data mining techniques have been proposed for mining useful patterns in databases. However, how to effectively utilize discovered patterns is still an open research issue, especially in the domain of text mining. Most existing methods adopt term-based approaches. However, they all suffer from the problems of polysemy and synonymy. This paper presents an innovative technique, pattern taxonomy mining, to improve the effectiveness of using discovered patterns for finding useful information. Substantial experiments on RCV1 demonstrate that the proposed solution achieves encouraging performance.
Resumo:
This report explains the objectives, datasets and evaluation criteria of both the clustering and classification tasks set in the INEX 2009 XML Mining track. The report also describes the approaches and results obtained by the different participants.
Resumo:
This article discusses some recent judicial decisions to assist legal practitioners to overcome some of the problems encountered when serving Bankruptcy Notices and Creditor’s Petitions. Some of the issues covered in the discussion are: What the valid last-known address of the debtor can be, whether a Bankruptcy Notice can be validly served by email on a debtor who is located outside Australia, whether service of a Bankruptcy Notice is valid when the debtor is outside Australia when service on the debtor occurs in Australia, whether the creditor’s failure to obtain leave for service of a Bankruptcy Notice can be excused, what can be done regarding personal service of a Creditor’s Petition when a debtor is outside Australia and whether the Court can set aside a sequestration order. The article goes on to place the issues in the context of broader bankruptcy policies noting that effective service of bankruptcy documents is challenging in a world where mobility of debtors is global and new modes of communication ever changing.
Resumo:
Textual cultural heritage artefacts present two serious problems for the encoder: how to record different or revised versions of the same work, and how to encode conflicting perspectives of the text using markup. Both are forms of textual variation, and can be accurately recorded using a multi-version document, based on a minimally redundant directed graph that cleanly separates variation from content.
Resumo:
A hierarchical structure is used to represent the content of the semi-structured documents such as XML and XHTML. The traditional Vector Space Model (VSM) is not sufficient to represent both the structure and the content of such web documents. Hence in this paper, we introduce a novel method of representing the XML documents in Tensor Space Model (TSM) and then utilize it for clustering. Empirical analysis shows that the proposed method is scalable for a real-life dataset as well as the factorized matrices produced from the proposed method helps to improve the quality of clusters due to the enriched document representation with both the structure and the content information.
Resumo:
The XML Document Mining track was launched for exploring two main ideas: (1) identifying key problems and new challenges of the emerging field of mining semi-structured documents, and (2) studying and assessing the potential of Machine Learning (ML) techniques for dealing with generic ML tasks in the structured domain, i.e., classification and clustering of semi-structured documents. This track has run for six editions during INEX 2005, 2006, 2007, 2008, 2009 and 2010. The first five editions have been summarized in previous editions and we focus here on the 2010 edition. INEX 2010 included two tasks in the XML Mining track: (1) unsupervised clustering task and (2) semi-supervised classification task where documents are organized in a graph. The clustering task requires the participants to group the documents into clusters without any knowledge of category labels using an unsupervised learning algorithm. On the other hand, the classification task requires the participants to label the documents in the dataset into known categories using a supervised learning algorithm and a training set. This report gives the details of clustering and classification tasks.
Resumo:
The traditional Vector Space Model (VSM) is not able to represent both the structure and the content of XML documents. This paper introduces a novel method of representing XML documents in a Tensor Space Model (TSM) and then utilizing it for clustering. Empirical analysis shows that the proposed method is scalable for large-sized datasets; as well, the factorized matrices produced from the proposed method help to improve the quality of clusters through the enriched document representation of both structure and content information.
Resumo:
Relevance Feedback (RF) has been proven very effective for improving retrieval accuracy. Adaptive information filtering (AIF) technology has benefited from the improvements achieved in all the tasks involved over the last decades. A difficult problem in AIF has been how to update the system with new feedback efficiently and effectively. In current feedback methods, the updating processes focus on updating system parameters. In this paper, we developed a new approach, the Adaptive Relevance Features Discovery (ARFD). It automatically updates the system's knowledge based on a sliding window over positive and negative feedback to solve a nonmonotonic problem efficiently. Some of the new training documents will be selected using the knowledge that the system currently obtained. Then, specific features will be extracted from selected training documents. Different methods have been used to merge and revise the weights of features in a vector space. The new model is designed for Relevance Features Discovery (RFD), a pattern mining based approach, which uses negative relevance feedback to improve the quality of extracted features from positive feedback. Learning algorithms are also proposed to implement this approach on Reuters Corpus Volume 1 and TREC topics. Experiments show that the proposed approach can work efficiently and achieves the encouragement performance.
Resumo:
Most Australian states have introduced legislation to provide for enduring documents for financial, personal and health care decision making in the event of incapacity. Since the introduction of Enduring Powers of Attorney (EPAs) and Advance Health Directives (AHDs) in Queensland in 1998, concerns have continued to be raised by service providers, professionals and individuals about the uptake, understanding and appropriate use of these documents. In response to these concerns, the Department of Justice and Attorney-General (DJAG) convened a Practical Guardianship Initiatives Working Party. This group identified the limited evidence base available to address these concerns. In 2009, a multidisciplinary research team from the University of Queensland and the Queensland University of Technology was awarded $90,000 from the Legal Practitioners Interest on Trust Account Fund to undertake a review of the current EPA and AHD forms. The goal of the research was to gather data on the content and useability of the forms from the perspectives of a range of stakeholders, particularly those completing the EPA and AHD, witnesses of these documents, attorneys appointed under an EPA, and health professionals involved in the completion of an AHD or dealing with it in a clinical context. The researchers also sought to gather information from the perspective of Aboriginal and Torres Strait Islander (ATSI) individuals as well people from culturally and linguistically diverse (CALD) groups. Although the focus of the research was on the forms and the extent to which the current design, content and format represents a barrier to uptake, in the course of the research, some broader issues were identified which have an impact on the effectiveness of the EPA and AHD in achieving the goals of planning for financial and personal and health care in advance of losing capacity. The data gathered enabled the researchers to achieve the primary goal of the research: to make recommendations to improve the content and useability of the forms which hopefully will lead to an increased uptake and appropriate use of the forms. However, the researchers thought it was important not to ignore broader policy issues that were identified in the course of the research. These broader issues have been highlighted in this Report, and the researchers have responded to them in a variety of ways. For some issues, the researchers have suggested alterations that could be made to the forms to address the particular concerns. For other issues, the researchers have suggested that Government may need to take specific action such as educating the broader community with some attention to strategies that engage particular groups within communities. Other concerns raised can only be dealt with by legislative reform and, in some of these cases, the researchers have identified issues that Government may wish to consider further. We do note, however, that it is beyond the scope of this Report to recommend changes to the law. This three stage mixed methods project aimed to provide systematic evidence from a broad range of stakeholders in regard to: (i) which groups use and do not use these documents and why, (ii) the contribution of the length/complexity/format/language of the forms as barriers to their completion and/or effective use, and (iii) the issues raised by the current documents for witnesses and attorneys. Understanding and use of EPAs and AHDs were generally explored in separate but parallel processes. A purposive sampling strategy included users of the documents as principals and attorneys, and professionals, witnesses and service providers who assist others to execute or use the forms. The first component of this study built on existing knowledge using a Critical Reference Group and material provided by the DJAG Practical Guardianship Initiatives Working Party. This assisted in the development of the data collection tools for subsequent stages. The second component comprised semi-structured interviews and focus groups with a targeted sample of current users of the forms, potential users, witnesses and other professionals to provide in-depth information on critical issues. Outreach to Aboriginal and Torres Strait Islander Elders and individuals and workers with CALD groups ensured a broad sample of potential users of the two documents. Fifty individual interviews and three focus groups were completed. Most interviews and focus groups focused on perceptions of, and experiences with, either the EPA or the AHD form. In the interviews with Indigenous people and the CALD focus groups, however, respondents provided their perceptions and experiences of both documents. In general, these respondents had not used the forms and were responding to the documents made available in the interview or focus group. In total, seventy-seven individuals were involved in interviews or focus groups. The final component comprised on-line surveys for EPA principals, EPA attorneys, AHD principals, witnesses of EPAs and AHDs and medical practitioners with experience of AHDs as nominated and/or treating doctors. The surveys were developed from the initial component and the qualitative analysis of the interview and focus group data. A total of 116 surveys were returned from major cities and regional Queensland. The survey data was analysed descriptively for patterns and trends. It is important to note that the aim of the survey was to gain insight into issues and concerns relating to the documents and not to make generalisations to the broader population.
Resumo:
With the growing number of XML documents on theWeb it becomes essential to effectively organise these XML documents in order to retrieve useful information from them. A possible solution is to apply clustering on the XML documents to discover knowledge that promotes effective data management, information retrieval and query processing. However, many issues arise in discovering knowledge from these types of semi-structured documents due to their heterogeneity and structural irregularity. Most of the existing research on clustering techniques focuses only on one feature of the XML documents, this being either their structure or their content due to scalability and complexity problems. The knowledge gained in the form of clusters based on the structure or the content is not suitable for reallife datasets. It therefore becomes essential to include both the structure and content of XML documents in order to improve the accuracy and meaning of the clustering solution. However, the inclusion of both these kinds of information in the clustering process results in a huge overhead for the underlying clustering algorithm because of the high dimensionality of the data. The overall objective of this thesis is to address these issues by: (1) proposing methods to utilise frequent pattern mining techniques to reduce the dimension; (2) developing models to effectively combine the structure and content of XML documents; and (3) utilising the proposed models in clustering. This research first determines the structural similarity in the form of frequent subtrees and then uses these frequent subtrees to represent the constrained content of the XML documents in order to determine the content similarity. A clustering framework with two types of models, implicit and explicit, is developed. The implicit model uses a Vector Space Model (VSM) to combine the structure and the content information. The explicit model uses a higher order model, namely a 3- order Tensor Space Model (TSM), to explicitly combine the structure and the content information. This thesis also proposes a novel incremental technique to decompose largesized tensor models to utilise the decomposed solution for clustering the XML documents. The proposed framework and its components were extensively evaluated on several real-life datasets exhibiting extreme characteristics to understand the usefulness of the proposed framework in real-life situations. Additionally, this research evaluates the outcome of the clustering process on the collection selection problem in the information retrieval on the Wikipedia dataset. The experimental results demonstrate that the proposed frequent pattern mining and clustering methods outperform the related state-of-the-art approaches. In particular, the proposed framework of utilising frequent structures for constraining the content shows an improvement in accuracy over content-only and structure-only clustering results. The scalability evaluation experiments conducted on large scaled datasets clearly show the strengths of the proposed methods over state-of-the-art methods. In particular, this thesis work contributes to effectively combining the structure and the content of XML documents for clustering, in order to improve the accuracy of the clustering solution. In addition, it also contributes by addressing the research gaps in frequent pattern mining to generate efficient and concise frequent subtrees with various node relationships that could be used in clustering.
Resumo:
In Bowenbrae Pty Ltd v Flying Fighters Maintenance and Restoration [2010] QDC 347 Reid DCJ made orders requiring the plaintiffs to make application under the Freedom of Information Act 1982 (Cth) (“the FOI Act”) for documents sought by the defendant.
Resumo:
Existing macro level research on the new venture creation process recognises the entrepreneur as a central agent in the process yet generally avoids, at each stage of the process, an examination of the micro level psychological behaviour of the individual entrepreneur. By integrating two theoretical approaches to entrepreneurship research, the psychology of the entrepreneur and the entrepreneurship process, this paper examines, using content analysis, the language used by new venture founders in documents directly linked to their capital raising activity. The study examined the language of 108 offer documents (information memorandum’s) which were divided between 54 new ventures that were successful in raising capital and 54 new ventures that either did not proceed further or were not successful in raising capital through the Australian Small Scale Offerings Board. Specifically, we were interested in examining the level of optimism evident in these narratives given that entrepreneurs have been previously described in the literature as being excessively optimistic.
Resumo:
Many existing information retrieval models do not explicitly take into account in- formation about word associations. Our approach makes use of rst and second order relationships found in natural language, known as syntagmatic and paradigmatic associ- ations, respectively. This is achieved by using a formal model of word meaning within the query expansion process. On ad hoc retrieval, our approach achieves statistically sig- ni cant improvements in MAP (0.158) and P@20 (0.396) over our baseline model. The ERR@20 and nDCG@20 of our system was 0.249 and 0.192 respectively. Our results and discussion suggest that information about both syntagamtic and paradigmatic associa- tions can assist with improving retrieval eectiveness on ad hoc retrieval.