964 resultados para Frequent subtrees
Resumo:
Extracting frequent subtrees from the tree structured data has important applications in Web mining. In this paper, we introduce a novel canonical form for rooted labelled unordered trees called the balanced-optimal-search canonical form (BOCF) that can handle the isomorphism problem efficiently. Using BOCF, we define a tree structure guided scheme based enumeration approach that systematically enumerates only the valid subtrees. Finally, we present the balanced optimal search tree miner (BOSTER) algorithm based on BOCF and the proposed enumeration approach, for finding frequent induced subtrees from a database of labelled rooted unordered trees. Experiments on the real datasets compare the efficiency of BOSTER over the two state-of-the-art algorithms for mining induced unordered subtrees, HybridTreeMiner and UNI3. The results are encouraging.
Resumo:
This paper presents an algorithm for mining unordered embedded subtrees using the balanced-optimal-search canonical form (BOCF). A tree structure guided scheme based enumeration approach is defined using BOCF for systematically enumerating the valid subtrees only. Based on this canonical form and enumeration technique, the balanced optimal search embedded subtree mining algorithm (BEST) is introduced for mining embedded subtrees from a database of labelled rooted unordered trees. The extensive experiments on both synthetic and real datasets demonstrate the efficiency of BEST over the two state-of-the-art algorithms for mining embedded unordered subtrees, SLEUTH and U3.
Resumo:
This paper proposes a novel Hybrid Clustering approach for XML documents (HCX) that first determines the structural similarity in the form of frequent subtrees and then uses these frequent subtrees to represent the constrained content of the XML documents in order to determine the content similarity. The empirical analysis reveals that the proposed method is scalable and accurate.
Resumo:
XML document clustering is essential for many document handling applications such as information storage, retrieval, integration and transformation. An XML clustering algorithm should process both the structural and the content information of XML documents in order to improve the accuracy and meaning of the clustering solution. However, the inclusion of both kinds of information in the clustering process results in a huge overhead for the underlying clustering algorithm because of the high dimensionality of the data. This paper introduces a novel approach that first determines the structural similarity in the form of frequent subtrees and then uses these frequent subtrees to represent the constrained content of the XML documents in order to determine the content similarity. The proposed method reduces the high dimensionality of input data by using only the structure-constrained content. The empirical analysis reveals that the proposed method can effectively cluster even very large XML datasets and outperform other existing methods.
Resumo:
This paper presents an overview of the experiments conducted using Hybrid Clustering of XML documents using Constraints (HCXC) method for the clustering task in the INEX 2009 XML Mining track. This technique utilises frequent subtrees generated from the structure to extract the content for clustering the XML documents. It also presents the experimental study using several data representations such as the structure-only, content-only and using both the structure and the content of XML documents for the purpose of clustering them. Unlike previous years, this year the XML documents were marked up using the Wiki tags and contains categories derived by using the YAGO ontology. This paper also presents the results of studying the effect of these tags on XML clustering using the HCXC method.
Resumo:
With the growing number of XML documents on theWeb it becomes essential to effectively organise these XML documents in order to retrieve useful information from them. A possible solution is to apply clustering on the XML documents to discover knowledge that promotes effective data management, information retrieval and query processing. However, many issues arise in discovering knowledge from these types of semi-structured documents due to their heterogeneity and structural irregularity. Most of the existing research on clustering techniques focuses only on one feature of the XML documents, this being either their structure or their content due to scalability and complexity problems. The knowledge gained in the form of clusters based on the structure or the content is not suitable for reallife datasets. It therefore becomes essential to include both the structure and content of XML documents in order to improve the accuracy and meaning of the clustering solution. However, the inclusion of both these kinds of information in the clustering process results in a huge overhead for the underlying clustering algorithm because of the high dimensionality of the data. The overall objective of this thesis is to address these issues by: (1) proposing methods to utilise frequent pattern mining techniques to reduce the dimension; (2) developing models to effectively combine the structure and content of XML documents; and (3) utilising the proposed models in clustering. This research first determines the structural similarity in the form of frequent subtrees and then uses these frequent subtrees to represent the constrained content of the XML documents in order to determine the content similarity. A clustering framework with two types of models, implicit and explicit, is developed. The implicit model uses a Vector Space Model (VSM) to combine the structure and the content information. The explicit model uses a higher order model, namely a 3- order Tensor Space Model (TSM), to explicitly combine the structure and the content information. This thesis also proposes a novel incremental technique to decompose largesized tensor models to utilise the decomposed solution for clustering the XML documents. The proposed framework and its components were extensively evaluated on several real-life datasets exhibiting extreme characteristics to understand the usefulness of the proposed framework in real-life situations. Additionally, this research evaluates the outcome of the clustering process on the collection selection problem in the information retrieval on the Wikipedia dataset. The experimental results demonstrate that the proposed frequent pattern mining and clustering methods outperform the related state-of-the-art approaches. In particular, the proposed framework of utilising frequent structures for constraining the content shows an improvement in accuracy over content-only and structure-only clustering results. The scalability evaluation experiments conducted on large scaled datasets clearly show the strengths of the proposed methods over state-of-the-art methods. In particular, this thesis work contributes to effectively combining the structure and the content of XML documents for clustering, in order to improve the accuracy of the clustering solution. In addition, it also contributes by addressing the research gaps in frequent pattern mining to generate efficient and concise frequent subtrees with various node relationships that could be used in clustering.
Resumo:
We present a method to enhance fault localization for software systems based on a frequent pattern mining algorithm. Our method is based on a large set of test cases for a given set of programs in which faults can be detected. The test executions are recorded as function call trees. Based on test oracles the tests can be classified into successful and failing tests. A frequent pattern mining algorithm is used to identify frequent subtrees in successful and failing test executions. This information is used to rank functions according to their likelihood of containing a fault. The ranking suggests an order in which to examine the functions during fault analysis. We validate our approach experimentally using a subset of Siemens benchmark programs.
Resumo:
Web data can often be represented in free tree form; however, free tree mining methods seldom exist. In this paper, a computationally fast algorithm FreeS is presented to discover all frequently occurring free subtrees in a database of labelled free trees. FreeS is designed using an optimal canonical form, BOCF that can uniquely represent free trees even during the presence of isomorphism. To avoid enumeration of false positive candidates, it utilises the enumeration approach based on a tree-structure guided scheme. This paper presents lemmas that introduce conditions to conform the generation of free tree candidates during enumeration. Empirical study using both real and synthetic datasets shows that FreeS is scalable and significantly outperforms (i.e. few orders of magnitude faster than) the state-of-the-art frequent free tree mining algorithms, HybridTreeMiner and FreeTreeMiner.
Resumo:
In this paper, we discuss our participation to the INEX 2008 Link-the-Wiki track. We utilized a sliding window based algorithm to extract the frequent terms and phrases. Using the extracted phrases and term as descriptive vectors, the anchors and relevant links (both incoming and outgoing) are recognized efficiently.
Resumo:
Endometrial carcinoma is the most common gynecological malignancy in the United States. Although most women present with early disease confined to the uterus, the majority of persistent or recurrent tumors are refractory to current chemotherapies. We have identified a total of 11 different FGFR2 mutations in 3/10 (30%) of endometrial cell lines and 19/187 (10%) of primary uterine tumors. Mutations were seen primarily in tumors of the endometrioid histologic subtype (18/115 cases investigated, 16%). The majority of the somatic mutations identified were identical to germline activating mutations in FGFR2 and FGFR3 that cause Apert Syndrome, Beare-Stevenson Syndrome, hypochondroplasia, achondroplasia and SADDAN syndrome. The two most common somatic mutations identified were S252W (in eight tumors) and N550K (in five samples). Four novel mutations were identified, three of which are also likely to result in receptor gain-of-function. Extensive functional analyses have already been performed on many of these mutations, demonstrating they result in receptor activation through a variety of mechanisms. The discovery of activating FGFR2 mutations in endometrial carcinoma raises the possibility of employing anti-FGFR molecularly targeted therapies in patients with advanced or recurrent endometrial carcinoma.
Resumo:
Using a genome-scanning approach to search for oncogenes, a recent report identifies somatic mutations in the signaling gene BRAF that are particularly prevalent in melanoma.
Resumo:
Background & aims The Australasian Nutrition Care Day Survey (ANCDS) ascertained if malnutrition and poor food intake are independent risk factors for health-related outcomes in Australian and New Zealand hospital patients. Methods Phase 1 recorded nutritional status (Subjective Global Assessment) and 24-h food intake (0, 25, 50, 75, 100% intake). Outcomes data (Phase 2) were collected 90-days post-Phase 1 and included length of hospital stay (LOS), readmissions and in-hospital mortality. Results Of 3122 participants (47% females, 65 ± 18 years) from 56 hospitals, 32% were malnourished and 23% consumed ≤ 25% of the offered food. Malnourished patients had greater median LOS (15 days vs. 10 days, p < 0.0001) and readmissions rates (36% vs. 30%, p = 0.001). Median LOS for patients consuming ≤ 25% of the food was higher than those consuming ≤ 50% (13 vs. 11 days, p < 0.0001). The odds of 90-day in-hospital mortality were twice greater for malnourished patients (CI: 1.09–3.34, p = 0.023) and those consuming ≤ 25% of the offered food (CI: 1.13–3.51, p = 0.017), respectively. Conclusion The ANCDS establishes that malnutrition and poor food intake are independently associated with in-hospital mortality in the Australian and New Zealand acute care setting.
Resumo:
What do we know? • Customer Experience is increasingly becoming the new standard for differentiation in both offline and online retailing, and offers a sustainable competitive advantage. o The economic value of a company’s offering has been observed to increase when the customer has a fulfilling shopping experience (Pine & Gilmore, 1998) o Crafting engaging and customer experience is a known method of generating loyalty, advocacy and word of mouth (Tynan & McKechnie, 2009). o A good experience can entice consumers to shop for longer and spend more (Kim, 2001). • The customer’s experience is made up of diverse elements occurring before, during and after the purchase itself. (Discussed further on page 5). It is cumulative over time and can be influenced by touch points across multiple channels. What remains unclear? • How do Coles customers respond to the elements of online customer experience? • How does the online customer experience differ for frequent and infrequent purchasers? • Do differences between genders and age cohorts for online customer experience exist?
Resumo:
Purpose Following the perspective of frustration theory customer frustration incidents lead to frustration behavior such as protest (negative word‐of‐mouth). On the internet customers can express their emotions verbally and non‐verbally in numerous web‐based review platforms. The purpose of this study is to investigate online dysfunctional customer behavior, in particular negative “word‐of‐web” (WOW) in online feedback forums, among customers who participate in frequent‐flier programs in the airline industry. Design/methodology/approach The study employs a variation of the critical incident technique (CIT) referred to as the critical internet feedback technique (CIFT). Qualitative data of customer reviews of 13 different frequent‐flier programs posted on the internet were collected and analyzed with regard to frustration incidents, verbal and non‐verbal emotional effects and types of dysfunctional word‐of‐web customer behavior. The sample includes 141 negative customer reviews based on non‐recommendations and low program ratings. Findings Problems with loyalty programs evoke negative emotions that are expressed in a spectrum of verbal and non‐verbal negative electronic word‐of‐mouth. Online dysfunctional behavior can vary widely from low ratings and non‐recommendations to voicing switching intentions to even stronger forms such as manipulation of others and revenge intentions. Research limitations/implications Results have to be viewed carefully due to methodological challenges with regard to the measurement of emotions, in particular the accuracy of self‐report techniques and the quality of online data. Generalization of the results is limited because the study utilizes data from only one industry. Further research is needed with regard to the exact differentiation of frustration from related constructs. In addition, large‐scale quantitative studies are necessary to specify and test the relationships between frustration incidents and subsequent dysfunctional customer behavior expressed in negative word‐of‐web. Practical implications The study yields important implications for the monitoring of the perceived quality of loyalty programs. Management can obtain valuable information about program‐related and/or relationship‐related frustration incidents that lead to online dysfunctional customer behavior. A proactive response strategy should be developed to deal with severe cases, such as sabotage plans. Originality/value This study contributes to knowledge regarding the limited research of online dysfunctional customer behavior as well as frustration incidents of loyalty programs. Also, the article presents a theoretical “customer frustration‐defection” framework that describes different levels of online dysfunctional behavior in relation to the level of frustration sensation that customers have experienced. The framework extends the existing perspective of the “customer satisfaction‐loyalty” framework developed by Heskett et al.