43 resultados para Classification Rules
em University of Queensland eSpace - Australia
Resumo:
The classification rules of linear discriminant analysis are defined by the true mean vectors and the common covariance matrix of the populations from which the data come. Because these true parameters are generally unknown, they are commonly estimated by the sample mean vector and covariance matrix of the data in a training sample randomly drawn from each population. However, these sample statistics are notoriously susceptible to contamination by outliers, a problem compounded by the fact that the outliers may be invisible to conventional diagnostics. High-breakdown estimation is a procedure designed to remove this cause for concern by producing estimates that are immune to serious distortion by a minority of outliers, regardless of their severity. In this article we motivate and develop a high-breakdown criterion for linear discriminant analysis and give an algorithm for its implementation. The procedure is intended to supplement rather than replace the usual sample-moment methodology of discriminant analysis either by providing indications that the dataset is not seriously affected by outliers (supporting the usual analysis) or by identifying apparently aberrant points and giving resistant estimators that are not affected by them.
Resumo:
Most of the modem developments with classification trees are aimed at improving their predictive capacity. This article considers a curiously neglected aspect of classification trees, namely the reliability of predictions that come from a given classification tree. In the sense that a node of a tree represents a point in the predictor space in the limit, the aim of this article is the development of localized assessment of the reliability of prediction rules. A classification tree may be used either to provide a probability forecast, where for each node the membership probabilities for each class constitutes the prediction, or a true classification where each new observation is predictively assigned to a unique class. Correspondingly, two types of reliability measure will be derived-namely, prediction reliability and classification reliability. We use bootstrapping methods as the main tool to construct these measures. We also provide a suite of graphical displays by which they may be easily appreciated. In addition to providing some estimate of the reliability of specific forecasts of each type, these measures can also be used to guide future data collection to improve the effectiveness of the tree model. The motivating example we give has a binary response, namely the presence or absence of a species of Eucalypt, Eucalyptus cloeziana, at a given sampling location in response to a suite of environmental covariates, (although the methods are not restricted to binary response data).
Resumo:
Systematic protocols that use decision rules or scores arc, seen to improve consistency and transparency in classifying the conservation status of species. When applying these protocols, assessors are typically required to decide on estimates for attributes That are inherently uncertain, Input data and resulting classifications are usually treated as though they arc, exact and hence without operator error We investigated the impact of data interpretation on the consistency of protocols of extinction risk classifications and diagnosed causes of discrepancies when they occurred. We tested three widely used systematic classification protocols employed by the World Conservation Union, NatureServe, and the Florida Fish and Wildlife Conservation Commission. We provided 18 assessors with identical information for 13 different species to infer estimates for each of the required parameters for the three protocols. The threat classification of several of the species varied from low risk to high risk, depending on who did the assessment. This occurred across the three Protocols investigated. Assessors tended to agree on their placement of species in the highest (50-70%) and lowest risk categories (20-40%), but There was poor agreement on which species should be placed in the intermediate categories, Furthermore, the correspondence between The three classification methods was unpredictable, with large variation among assessors. These results highlight the importance of peer review and consensus among multiple assessors in species classifications and the need to be cautious with assessments carried out 4), a single assessor Greater consistency among assessors requires wide use of training manuals and formal methods for estimating parameters that allow uncertainties to be represented, carried through chains of calculations, and reported transparently.
Mixed-state entanglement in the light of pure-state entanglement constrained by superselection rules
Resumo:
We show that the classification of bi-partite pure entangled states when local quantum operations are restricted, e.g., constrained by local superselection rules, yields a structure that is analogous in many respects to that of mixed-state entanglement, including such exotic phenomena as bound entanglement and activation. This analogy aids in resolving several conceptual puzzles in the study of entanglement under restricted operations. Specifically, we demonstrate that several types of quantum optical states that possess confusing entanglement properties are analogous to bound entangled states. Also, the classification of pure-state entanglement under restricted operations can be much simpler than for mixed state entanglement. For instance, in the case of local Abelian superselection rules all questions concerning distillability can be resolved.
Resumo:
Developing a unified classification system to replace four of the systems currently used in disability athletics (i.e., track and field) has been widely advocated. The diverse impairments to be included in a unified system require severed assessment methods, results of which cannot be meaningfully compared. Therefore, the taxonomic basis of current classification systems is invalid in a unified system. Biomechanical analysis establishes that force, a vector described in terms of magnitude and direction, is a key determinant of success in all athletic disciplines. It is posited that all impairments to be included in a unified system may be classified as either force magnitude impairments (FMI) or force control impairments (FCI). This framework would provide a valid taxonomic basis for a unified system, creating the opportunity to decrease the number of classes and enhance the viability of disability athletics.
Resumo:
In the context of cancer diagnosis and treatment, we consider the problem of constructing an accurate prediction rule on the basis of a relatively small number of tumor tissue samples of known type containing the expression data on very many (possibly thousands) genes. Recently, results have been presented in the literature suggesting that it is possible to construct a prediction rule from only a few genes such that it has a negligible prediction error rate. However, in these results the test error or the leave-one-out cross-validated error is calculated without allowance for the selection bias. There is no allowance because the rule is either tested on tissue samples that were used in the first instance to select the genes being used in the rule or because the cross-validation of the rule is not external to the selection process; that is, gene selection is not performed in training the rule at each stage of the cross-validation process. We describe how in practice the selection bias can be assessed and corrected for by either performing a cross-validation or applying the bootstrap external to the selection process. We recommend using 10-fold rather than leave-one-out cross-validation, and concerning the bootstrap, we suggest using the so-called. 632+ bootstrap error estimate designed to handle overfitted prediction rules. Using two published data sets, we demonstrate that when correction is made for the selection bias, the cross-validated error is no longer zero for a subset of only a few genes.
Resumo:
Data mining is the process to identify valid, implicit, previously unknown, potentially useful and understandable information from large databases. It is an important step in the process of knowledge discovery in databases, (Olaru & Wehenkel, 1999). In a data mining process, input data can be structured, seme-structured, or unstructured. Data can be in text, categorical or numerical values. One of the important characteristics of data mining is its ability to deal data with large volume, distributed, time variant, noisy, and high dimensionality. A large number of data mining algorithms have been developed for different applications. For example, association rules mining can be useful for market basket problems, clustering algorithms can be used to discover trends in unsupervised learning problems, classification algorithms can be applied in decision-making problems, and sequential and time series mining algorithms can be used in predicting events, fault detection, and other supervised learning problems (Vapnik, 1999). Classification is among the most important tasks in the data mining, particularly for data mining applications into engineering fields. Together with regression, classification is mainly for predictive modelling. So far, there have been a number of classification algorithms in practice. According to (Sebastiani, 2002), the main classification algorithms can be categorized as: decision tree and rule based approach such as C4.5 (Quinlan, 1996); probability methods such as Bayesian classifier (Lewis, 1998); on-line methods such as Winnow (Littlestone, 1988) and CVFDT (Hulten 2001), neural networks methods (Rumelhart, Hinton & Wiliams, 1986); example-based methods such as k-nearest neighbors (Duda & Hart, 1973), and SVM (Cortes & Vapnik, 1995). Other important techniques for classification tasks include Associative Classification (Liu et al, 1998) and Ensemble Classification (Tumer, 1996).
Resumo:
The graded-fermion algebra and quasispin formalism are introduced and applied to obtain the gl(m\n)down arrow osp(m\n) branching rules for the two- column tensor irreducible representations of gl(m\n), for the case m less than or equal to n(n > 2). In the case m < n, all such irreducible representations of gl(m\n) are shown to be completely reducible as representations of osp(m\n). This is also shown to be true for the case m=n, except for the spin-singlet representations, which contain an indecomposable representation of osp(m\n) with composition length 3. These branching rules are given in fully explicit form. (C) 1999 American Institute of Physics. [S0022-2488(99)04410-2].
Resumo:
Strain-dependent hydraulic conductivities are uniquely defined by an environmental factor, representing applied normal and shear strains, combined with intrinsic material parameters representing mass and component deformation moduli, initial conductivities, and mass structure. The components representing mass moduli and structure are defined in terms of RQD (rock quality designation) and RMR (rock mass rating) to represent the response of a whole spectrum of rock masses, varying from highly fractured (crushed) rock to intact rock. These two empirical parameters determine the hydraulic response of a fractured medium to the induced-deformations The constitutive relations are verified against available published data and applied to study one-dimensional, strain-dependent fluid flow. Analytical results indicate that both normal and shear strains exert a significant influence on the processes of fluid flow and that the magnitude of this influence is regulated by the values of RQD and RMR.
Resumo:
Examples from the Murray-Darling basin in Australia are used to illustrate different methods of disaggregation of reconnaissance-scale maps. One approach for disaggregation revolves around the de-convolution of the soil-landscape paradigm elaborated during a soil survey. The descriptions of soil ma units and block diagrams in a soil survey report detail soil-landscape relationships or soil toposequences that can be used to disaggregate map units into component landscape elements. Toposequences can be visualised on a computer by combining soil maps with digital elevation data. Expert knowledge or statistics can be used to implement the disaggregation. Use of a restructuring element and k-means clustering are illustrated. Another approach to disaggregation uses training areas to develop rules to extrapolate detailed mapping into other, larger areas where detailed mapping is unavailable. A two-level decision tree example is presented. At one level, the decision tree method is used to capture mapping rules from the training area; at another level, it is used to define the domain over which those rules can be extrapolated. (C) 2001 Elsevier Science B.V. All rights reserved.