74 resultados para Document classification,Naive Bayes classifier,Verb-object pairs
em University of Queensland eSpace - Australia
Resumo:
The Tree Augmented Naïve Bayes (TAN) classifier relaxes the sweeping independence assumptions of the Naïve Bayes approach by taking account of conditional probabilities. It does this in a limited sense, by incorporating the conditional probability of each attribute given the class and (at most) one other attribute. The method of boosting has previously proven very effective in improving the performance of Naïve Bayes classifiers and in this paper, we investigate its effectiveness on application to the TAN classifier.
Resumo:
Conventionally, document classification researches focus on improving the learning capabilities of classifiers. Nevertheless, according to our observation, the effectiveness of classification is limited by the suitability of document representation. Intuitively, the more features that are used in representation, the more comprehensive that documents are represented. However, if a representation contains too many irrelevant features, the classifier would suffer from not only the curse of high dimensionality, but also overfitting. To address this problem of suitableness of document representations, we present a classifier-independent approach to measure the effectiveness of document representations. Our approach utilises a labelled document corpus to estimate the distribution of documents in the feature space. By looking through documents in this way, we can clearly identify the contributions made by different features toward the document classification. Some experiments have been performed to show how the effectiveness is evaluated. Our approach can be used as a tool to assist feature selection, dimensionality reduction and document classification.
Resumo:
Document classification is a supervised machine learning process, where predefined category labels are assigned to documents based on the hypothesis derived from training set of labelled documents. Documents cannot be directly interpreted by a computer system unless they have been modelled as a collection of computable features. Rogati and Yang [M. Rogati and Y. Yang, Resource selection for domain-specific cross-lingual IR, in SIGIR 2004: Proceedings of the 27th annual international conference on Research and Development in Information Retrieval, ACM Press, Sheffied: United Kingdom, pp. 154-161.] pointed out that the effectiveness of document classification system may vary in different domains. This implies that the quality of document model contributes to the effectiveness of document classification. Conventionally, model evaluation is accomplished by comparing the effectiveness scores of classifiers on model candidates. However, this kind of evaluation methods may encounter either under-fitting or over-fitting problems, because the effectiveness scores are restricted by the learning capacities of classifiers. We propose a model fitness evaluation method to determine whether a model is sufficient to distinguish positive and negative instances while still competent to provide satisfactory effectiveness with a small feature subset. Our experiments demonstrated how the fitness of models are assessed. The results of our work contribute to the researches of feature selection, dimensionality reduction and document classification.
Resumo:
In this article, we propose a framework, namely, Prediction-Learning-Distillation (PLD) for interactive document classification and distilling misclassified documents. Whenever a user points out misclassified documents, the PLD learns from the mistakes and identifies the same mistakes from all other classified documents. The PLD then enforces this learning for future classifications. If the classifier fails to accept relevant documents or reject irrelevant documents on certain categories, then PLD will assign those documents as new positive/negative training instances. The classifier can then strengthen its weakness by learning from these new training instances. Our experiments’ results have demonstrated that the proposed algorithm can learn from user-identified misclassified documents, and then distil the rest successfully.
Resumo:
The Edinburgh-Cape Blue Object Survey is a major survey to discover blue stellar objects brighter than B similar to 18 in the southern sky. It is planned to cover an area of sky of 10 000 deg(2) with \b\ > 30 degrees and delta < 0 degrees. The blue stellar objects are selected by automatic techniques from U and B pairs of UK Schmidt Telescope plates scanned with the COSMOS measuring machine. Follow-up photometry and spectroscopy are being obtained with the SAAO telescopes to classify objects brighter than B = 16.5. This paper describes the survey, the techniques used to extract the blue stellar objects, the photometric methods and accuracy, the spectroscopic classification, and the limits and completeness of the survey.
Resumo:
Merkel cell carcinoma (MCC) is a rare aggressive skin tumor which shares histopathological and genetic features with small-cell lung carcinoma (SCLC), both are of neuroendocrine origin. Comparable to SCLC, MCC cell lines are classified into two different biochemical subgroups designated as 'Classic' and 'Variant'. With the aim to identify typical gene-expression signatures associated with these phenotypically different MCC cell lines subgroups and to search for differentially expressed genes between MCC and SCLC, we used cDNA arrays to pro. le 10 MCC cell lines and four SCLC cell lines. Using significance analysis of microarrays, we defined a set of 76 differentially expressed genes that allowed unequivocal identification of Classic and Variant MCC subgroups. We assume that the differential expression levels of some of these genes reflect, analogous to SCLC, the different biological and clinical properties of Classic and Variant MCC phenotypes. Therefore, they may serve as useful prognostic markers and potential targets for the development of new therapeutic interventions specific for each subgroup. Moreover, our analysis identified 17 powerful classifier genes capable of discriminating MCC from SCLC. Real-time quantitative RT-PCR analysis of these genes on 26 additional MCC and SCLC samples confirmed their diagnostic classification potential, opening opportunities for new investigations into these aggressive cancers.
Resumo:
Racing algorithms have recently been proposed as a general-purpose method for performing model selection in machine teaming algorithms. In this paper, we present an empirical study of the Hoeffding racing algorithm for selecting the k parameter in a simple k-nearest neighbor classifier. Fifteen widely-used classification datasets from UCI are used and experiments conducted across different confidence levels for racing. The results reveal a significant amount of sensitivity of the k-nn classifier to its model parameter value. The Hoeffding racing algorithm also varies widely in its performance, in terms of the computational savings gained over an exhaustive evaluation. While in some cases the savings gained are quite small, the racing algorithm proved to be highly robust to the possibility of erroneously eliminating the optimal models. All results were strongly dependent on the datasets used.
Resumo:
Support vector machines (SVMs) have recently emerged as a powerful technique for solving problems in pattern classification and regression. Best performance is obtained from the SVM its parameters have their values optimally set. In practice, good parameter settings are usually obtained by a lengthy process of trial and error. This paper describes the use of genetic algorithm to evolve these parameter settings for an application in mobile robotics.
Resumo:
The Java programming language supports concurrency. Concurrent programs are hard to test due to their inherent non-determinism. This paper presents a classification of concurrency failures that is based on a model of Java concurrency. The model and failure classification is used to justify coverage of synchronization primitives of concurrent components. This is achieved by constructing concurrency flow graphs for each method call. A producer-consumer monitor is used to demonstrate how the approach can be used to measure coverage of concurrency primitives and thereby assist in determining test sequences for deterministic execution.
Resumo:
Results of two experiments are reported that examined how people respond to rectangular targets of different sizes in simple hitting tasks. If a target moves in a straight line and a person is constrained to move along a linear track oriented perpendicular to the targetrsquos motion, then the length of the target along its direction of motion constrains the temporal accuracy and precision required to make the interception. The dimensions of the target perpendicular to its direction of motion place no constraints on performance in such a task. In contrast, if the person is not constrained to move along a straight track, the targetrsquos dimensions may constrain the spatial as well as the temporal accuracy and precision. The experiments reported here examined how people responded to targets of different vertical extent (height): the task was to strike targets that moved along a straight, horizontal path. In experiment 1 participants were constrained to move along a horizontal linear track to strike targets and so target height did not constrain performance. Target height, length and speed were co-varied. Movement time (MT) was unaffected by target height but was systematically affected by length (briefer movements to smaller targets) and speed (briefer movements to faster targets). Peak movement speed (Vmax) was influenced by all three independent variables: participants struck shorter, narrower and faster targets harder. In experiment 2, participants were constrained to move in a vertical plane normal to the targetrsquos direction of motion. In this task target height constrains the spatial accuracy required to contact the target. Three groups of eight participants struck targets of different height but of constant length and speed, hence constant temporal accuracy demand (different for each group, one group struck stationary targets = no temporal accuracy demand). On average, participants showed little or no systematic response to changes in spatial accuracy demand on any dependent measure (MT, Vmax, spatial variable error). The results are interpreted in relation to previous results on movements aimed at stationary targets in the absence of visual feedback.
Resumo:
Developing a unified classification system to replace four of the systems currently used in disability athletics (i.e., track and field) has been widely advocated. The diverse impairments to be included in a unified system require severed assessment methods, results of which cannot be meaningfully compared. Therefore, the taxonomic basis of current classification systems is invalid in a unified system. Biomechanical analysis establishes that force, a vector described in terms of magnitude and direction, is a key determinant of success in all athletic disciplines. It is posited that all impairments to be included in a unified system may be classified as either force magnitude impairments (FMI) or force control impairments (FCI). This framework would provide a valid taxonomic basis for a unified system, creating the opportunity to decrease the number of classes and enhance the viability of disability athletics.
Resumo:
While a number of studies have shown that object-extracted relative clauses are more difficult to understand than subject-extracted counterparts for second language (L2) English learners (e.g., Izumi, 2003), less is known about why this is the case and how they process these complex sentences. This exploratory study examines the potential applicability of Gibson's (1998, 2000) Syntactic Prediction Locality Theory (SPLT), a theory proposed to predict first language (L1) processing difficulty, to L2 processing and considers whether the theory might also account for the processing difficulties of subject- and object-extracted relative clauses encountered by L2 learners. Results of a self-paced reading time experiment from 15 Japanese learners of English are mainly consistent with the reading time profile predicted by the SPLT and thus suggest that the L1 processing theory might also be able to account for L2 processing difficulty.
Resumo:
Data mining is the process to identify valid, implicit, previously unknown, potentially useful and understandable information from large databases. It is an important step in the process of knowledge discovery in databases, (Olaru & Wehenkel, 1999). In a data mining process, input data can be structured, seme-structured, or unstructured. Data can be in text, categorical or numerical values. One of the important characteristics of data mining is its ability to deal data with large volume, distributed, time variant, noisy, and high dimensionality. A large number of data mining algorithms have been developed for different applications. For example, association rules mining can be useful for market basket problems, clustering algorithms can be used to discover trends in unsupervised learning problems, classification algorithms can be applied in decision-making problems, and sequential and time series mining algorithms can be used in predicting events, fault detection, and other supervised learning problems (Vapnik, 1999). Classification is among the most important tasks in the data mining, particularly for data mining applications into engineering fields. Together with regression, classification is mainly for predictive modelling. So far, there have been a number of classification algorithms in practice. According to (Sebastiani, 2002), the main classification algorithms can be categorized as: decision tree and rule based approach such as C4.5 (Quinlan, 1996); probability methods such as Bayesian classifier (Lewis, 1998); on-line methods such as Winnow (Littlestone, 1988) and CVFDT (Hulten 2001), neural networks methods (Rumelhart, Hinton & Wiliams, 1986); example-based methods such as k-nearest neighbors (Duda & Hart, 1973), and SVM (Cortes & Vapnik, 1995). Other important techniques for classification tasks include Associative Classification (Liu et al, 1998) and Ensemble Classification (Tumer, 1996).