125 resultados para Association rule mining
em University of Queensland eSpace - Australia
Resumo:
Data mining is the process to identify valid, implicit, previously unknown, potentially useful and understandable information from large databases. It is an important step in the process of knowledge discovery in databases, (Olaru & Wehenkel, 1999). In a data mining process, input data can be structured, seme-structured, or unstructured. Data can be in text, categorical or numerical values. One of the important characteristics of data mining is its ability to deal data with large volume, distributed, time variant, noisy, and high dimensionality. A large number of data mining algorithms have been developed for different applications. For example, association rules mining can be useful for market basket problems, clustering algorithms can be used to discover trends in unsupervised learning problems, classification algorithms can be applied in decision-making problems, and sequential and time series mining algorithms can be used in predicting events, fault detection, and other supervised learning problems (Vapnik, 1999). Classification is among the most important tasks in the data mining, particularly for data mining applications into engineering fields. Together with regression, classification is mainly for predictive modelling. So far, there have been a number of classification algorithms in practice. According to (Sebastiani, 2002), the main classification algorithms can be categorized as: decision tree and rule based approach such as C4.5 (Quinlan, 1996); probability methods such as Bayesian classifier (Lewis, 1998); on-line methods such as Winnow (Littlestone, 1988) and CVFDT (Hulten 2001), neural networks methods (Rumelhart, Hinton & Wiliams, 1986); example-based methods such as k-nearest neighbors (Duda & Hart, 1973), and SVM (Cortes & Vapnik, 1995). Other important techniques for classification tasks include Associative Classification (Liu et al, 1998) and Ensemble Classification (Tumer, 1996).
Resumo:
Web transaction data between Web visitors and Web functionalities usually convey user task-oriented behavior pattern. Mining such type of click-stream data will lead to capture usage pattern information. Nowadays Web usage mining technique has become one of most widely used methods for Web recommendation, which customizes Web content to user-preferred style. Traditional techniques of Web usage mining, such as Web user session or Web page clustering, association rule and frequent navigational path mining can only discover usage pattern explicitly. They, however, cannot reveal the underlying navigational activities and identify the latent relationships that are associated with the patterns among Web users as well as Web pages. In this work, we propose a Web recommendation framework incorporating Web usage mining technique based on Probabilistic Latent Semantic Analysis (PLSA) model. The main advantages of this method are, not only to discover usage-based access pattern, but also to reveal the underlying latent factor as well. With the discovered user access pattern, we then present user more interested content via collaborative recommendation. To validate the effectiveness of proposed approach, we conduct experiments on real world datasets and make comparisons with some existing traditional techniques. The preliminary experimental results demonstrate the usability of the proposed approach.
Resumo:
The principle of using induction rules based on spatial environmental data to model a soil map has previously been demonstrated Whilst the general pattern of classes of large spatial extent and those with close association with geology were delineated small classes and the detailed spatial pattern of the map were less well rendered Here we examine several strategies to improve the quality of the soil map models generated by rule induction Terrain attributes that are better suited to landscape description at a resolution of 250 m are introduced as predictors of soil type A map sampling strategy is developed Classification error is reduced by using boosting rather than cross validation to improve the model Further the benefit of incorporating the local spatial context for each environmental variable into the rule induction is examined The best model was achieved by sampling in proportion to the spatial extent of the mapped classes boosting the decision trees and using spatial contextual information extracted from the environmental variables.
Resumo:
Instantaneous outbursts in underground coal mines have occurred in at least 16 countries, involving both methane (CH4) and carbon dioxide (CO2). The precise mechanisms of an instantaneous outburst are still unresolved but must consider the effects of stress, gas content and physico-mechanical properties of the coal. Other factors such as mining methods (e.g., development heading into the coal seam) and geological features (e.g., coal seam disruptions from faulting) can combine to exacerbate the problem. Prediction techniques continue to be unreliable and unexpected outburst incidents resulting in fatalities are a major concern for underground coal operations. Gas content thresholds of 9 m(3)/t for CH4 and 6 m(3)/t for CO2 are used in the Sydney Basin, to indicate outburst-prone conditions, but are reviewed on an individual mine basis and in mixed as situations. Data on the sorption behaviour of Bowen Basin coals from Australia have provided an explanation for the conflicting results obtained by coal face desorption indices used for outburst-proneness assessment. A key factor appears to be different desorption rates displayed by banded coals, which is supported by both laboratory and mine-site investigations. Dull coal bands with high fusinite and semifusinite contents tend to display rapid desorption from solid coal, for a given pressure drop. The opposite is true for bright coal bands with high vitrinite contents and dull coal bands with high inertodetrinite contents. Consequently, when face samples of dull, fusinite-or semifusinite-rich coal of small particle size are taken for desorption testing, much gas has already escaped and low readings result. The converse applies for samples taken from coal bands with high vitrinite and/or inertodetrinite contents. In terms of outburst potential, it is the bright, vitrinite-rich and the dull, inertodetrinite-rich sections of a coal seam that appear to be more outburst-prone. This is due to the ability of the solid coal to retain gas, even after pressure reduction, creating a gas content gradient across the coal face sufficient to initiate an outburst. Once the particle size of the coal is reduced, rapid gas desorption can then take place. (C) 1998 Elsevier Science.
Resumo:
Epidemiological studies suggest that ovarian cancer is an endocrine-related tumour, and progesterone exposure specifically may decrease the risk of ovarian cancer. To assess whether the progesterone receptor (PR) exon 4 valine to leucine amino acid variant is associated with specific tumour characteristics or with overall risk of ovarian cancer, we examined 551 cases of epithelial ovarian cancer and 298 unaffected controls for the underlying G-->T nucleotide substitution polymorphism. Stratification of the ovarian cancer cases according to tumour behaviour (low malignant potential or invasive), histology, grade or stage failed to reveal any heterogeneity with respect to the genotype defined by the PR exon 4 polymorphism. Furthermore, the genotype distribution did not differ significantly between ovarian cancer cases and unaffected controls. Compared with the GG genotype, the age-adjusted odds ratio (95% confidence interval) for risk of ovarian cancer was 0.78 (0.57-1.08) for the GT genotype, and 1.39 (0.47-4.14) for the TT genotype. In conclusion, the PR exon 4 codon 660 leucine variant encoded by the T allele does not appear to be associated with ovarian tumour behaviour, histology, stage or grade. This variant is also not associated with an increased risk of ovarian cancer, and is unlikely to be associated with a large decrease in ovarian cancer risk, although we cannot rule out a moderate inverse association between the GT genotype and ovarian cancer.
Resumo:
Patterns of association of digenean families and their mollusc and vertebrate hosts are assessed by way of a new database containing information on over 1000 species of digeneans for lift-cycles and over 5000 species from fishes. Analysis of the distribution of digenean families in molluscs suggests that the group was associated primitively with gastropods and that infection of polychaetes, bivalves and scaphopods are all the results of host-switching. For the vertebrates. infections of agnathans and chondrichthyans are apparently the result of host-switching from teleosts. For digenean families the ratio of orders of fishes infected to superfamilies of molluscs infected ranges from 0.5 (Mesometridae) to 16 (Bivesiculidae) and has a mean of 5.6. Individual patterns of host association of 13 dipenean families and superfamilies are reviewed. Two, Bucephalidae and Sanguinicolidae. are exceptional in infecting a range of first intermediate hosts qualitatively as broad as their range of definitive hosts. No well-studied taxon shows narrower association with vertebrate than with mollusc clades. The range of definitive hosts of digeneans is characteristically defined by eco-physiological similarity rather than phylogenetic relationship. The range of associations of digenean families with mollusc taxa is generally much narrower. These data are considered in the light of ideas about the significance of different forms of host association. If Manter's Second Rule (the longer the association with a host group, the mure pronounced the specificity exhibited by the parasite group) is invoked, then the data may suggest that the Digenea first parasitised molluscs before adopting vertebrate hosts. This interpretation is consistent with most previous ideas about the evolution of the Digenea but contrary to current interpretations based on the monophyly of the Neodermata. The basis of Manter's Second Rule is. however, considered too flimsy for this interpretation to be robust. Problems of the inference of the evolution of patterns of parasitism in the Neodermata al-e discussed and considered so intractable that the truth may be presently unknowable. (C) 2001 Australian Society for Parasitology Inc. Published by Elsevier Science Ltd. All rights reserved.
Resumo:
A biologically realizable, unsupervised learning rule is described for the online extraction of object features, suitable for solving a range of object recognition tasks. Alterations to the basic learning rule are proposed which allow the rule to better suit the parameters of a given input space. One negative consequence of such modifications is the potential for learning instability. The criteria for such instability are modeled using digital filtering techniques and predicted regions of stability and instability tested. The result is a family of learning rules which can be tailored to the specific environment, improving both convergence times and accuracy over the standard learning rule, while simultaneously insuring learning stability.
Resumo:
Frequent Itemsets mining is well explored for various data types, and its computational complexity is well understood. There are methods to deal effectively with computational problems. This paper shows another approach to further performance enhancements of frequent items sets computation. We have made a series of observations that led us to inventing data pre-processing methods such that the final step of the Partition algorithm, where a combination of all local candidate sets must be processed, is executed on substantially smaller input data. The paper shows results from several experiments that confirmed our general and formally presented observations.
Resumo:
Objective: An estimation of cut-off points for the diagnosis of diabetes mellitus (DM) based on individual risk factors. Methods: A subset of the 1991 Oman National Diabetes Survey is used, including all patients with a 2h post glucose load >= 200 mg/dl (278 subjects) and a control group of 286 subjects. All subjects previously diagnosed as diabetic and all subjects with missing data values were excluded. The data set was analyzed by use of the SPSS Clementine data mining system. Decision Tree Learners (C5 and CART) and a method for mining association rules (the GRI algorithm) are used. The fasting plasma glucose (FPG), age, sex, family history of diabetes and body mass index (BMI) are input risk factors (independent variables), while diabetes onset (the 2h post glucose load >= 200 mg/dl) is the output (dependent variable). All three techniques used were tested by use of crossvalidation (89.8%). Results: Rules produced for diabetes diagnosis are: A- GRI algorithm (1) FPG>=108.9 mg/dl, (2) FPG>=107.1 and age>39.5 years. B- CART decision trees: FPG >=110.7 mg/dl. C- The C5 decision tree learner: (1) FPG>=95.5 and 54, (2) FPG>=106 and 25.2 kg/m2. (3) FPG>=106 and =133 mg/dl. The three techniques produced rules which cover a significant number of cases (82%), with confidence between 74 and 100%. Conclusion: Our approach supports the suggestion that the present cut-off value of fasting plasma glucose (126 mg/dl) for the diagnosis of diabetes mellitus needs revision, and the individual risk factors such as age and BMI should be considered in defining the new cut-off value.