906 resultados para decision tree


Relevância:

60.00% 60.00%

Publicador:

Resumo:

The induction of classification rules from previously unseen examples is one of the most important data mining tasks in science as well as commercial applications. In order to reduce the influence of noise in the data, ensemble learners are often applied. However, most ensemble learners are based on decision tree classifiers which are affected by noise. The Random Prism classifier has recently been proposed as an alternative to the popular Random Forests classifier, which is based on decision trees. Random Prism is based on the Prism family of algorithms, which is more robust to noise. However, like most ensemble classification approaches, Random Prism also does not scale well on large training data. This paper presents a thorough discussion of Random Prism and a recently proposed parallel version of it called Parallel Random Prism. Parallel Random Prism is based on the MapReduce programming paradigm. The paper provides, for the first time, novel theoretical analysis of the proposed technique and in-depth experimental study that show that Parallel Random Prism scales well on a large number of training examples, a large number of data features and a large number of processors. Expressiveness of decision rules that our technique produces makes it a natural choice for Big Data applications where informed decision making increases the user’s trust in the system.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The substitution of missing values, also called imputation, is an important data preparation task for many domains. Ideally, the substitution of missing values should not insert biases into the dataset. This aspect has been usually assessed by some measures of the prediction capability of imputation methods. Such measures assume the simulation of missing entries for some attributes whose values are actually known. These artificially missing values are imputed and then compared with the original values. Although this evaluation is useful, it does not allow the influence of imputed values in the ultimate modelling task (e.g. in classification) to be inferred. We argue that imputation cannot be properly evaluated apart from the modelling task. Thus, alternative approaches are needed. This article elaborates on the influence of imputed values in classification. In particular, a practical procedure for estimating the inserted bias is described. As an additional contribution, we have used such a procedure to empirically illustrate the performance of three imputation methods (majority, naive Bayes and Bayesian networks) in three datasets. Three classifiers (decision tree, naive Bayes and nearest neighbours) have been used as modelling tools in our experiments. The achieved results illustrate a variety of situations that can take place in the data preparation practice.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The main purpose of this thesis project is to prediction of symptom severity and cause in data from test battery of the Parkinson’s disease patient, which is based on data mining. The collection of the data is from test battery on a hand in computer. We use the Chi-Square method and check which variables are important and which are not important. Then we apply different data mining techniques on our normalize data and check which technique or method gives good results.The implementation of this thesis is in WEKA. We normalize our data and then apply different methods on this data. The methods which we used are Naïve Bayes, CART and KNN. We draw the Bland Altman and Spearman’s Correlation for checking the final results and prediction of data. The Bland Altman tells how the percentage of our confident level in this data is correct and Spearman’s Correlation tells us our relationship is strong. On the basis of results and analysis we see all three methods give nearly same results. But if we see our CART (J48 Decision Tree) it gives good result of under predicted and over predicted values that’s lies between -2 to +2. The correlation between the Actual and Predicted values is 0,794in CART. Cause gives the better percentage classification result then disability because it can use two classes.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

A programming style can be seen as a particular model of shaping thought or a special way of codifying language to solve a problem. An adaptive device is made up of an underlying formalism, for instance, an automaton, a grammar, a decision tree, etc., and an adaptive mechanism, responsible for providing features for self-modification. Adaptive languages are obtained by using some programming language as the device’s underlying formalism. The conception of such languages calls for a new programming style, since the application of adaptive technology in the field of programming languages suggests a new way of thinking. Adaptive languages have the basic feature of allowing the expression of programs which self-modifying through adaptive actions at runtime. With the adaptive style, programming language codes can be structured in such a way that the codified program therein modifies or adapts itself towards the needs of the problem. The adaptive programming style may be a feasible alternate way to obtain self-modifying consistent codes, which allow its use in modern applications for self-modifying code.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

An adaptive device is made up of an underlying mechanism, for instance, an automaton, a grammar, a decision tree, etc., to which is added an adaptive mechanism, responsible for allowing a dynamic modification in the structure of the underlying mechanism. This article aims to investigate if a programming language can be used as an underlying mechanism of an adaptive device, resulting in an adaptive language.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Layer mortality due to heat stress is an important economic loss for the producer. The aim of this study was to determine the mortality pattern of layers reared in the region of Bastos, SP, Brazil, according to external environment and bird age. Data mining technique were used based on monthly mortality records of hens in production, 135 poultry houses, from January 2004 to August 2008. The external environment was characterized according maximum and minimum temperatures, obtained monthly at the meteorological station CATI in the city of Tupa, SP, Brazil. Mortality was classified as normal (<= 1.2%) or high (> 1.2%), considering the mortality limits mentioned in literature. Data mining technique produced a decision tree with nine levels and 23 leaves, with 62.6% of overall accuracy. The hit rate for the High class was 64.1% and 59.9% for Normal class. The decision tree allowed finding a pattern in the mortality data, generating a model for estimating mortality based on the thermal environment and bird age.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Foliar diagnosis is a method for assessing the nutritional status of agricultural crops, which helps in the understanding of soil fertility and rationalized application of fertilizers taking into account economic and environmental criteria. The study aimed to use the landrelief as criteria to assist in interpreting the spatial variability of nutrient content of the citrus leaf. The leaves were collected at regular intervals of 50 m, totaling 332 sampling points. Data were analyzed by descriptive statistics, geostatistics and induction of decision tree. With the aid of digital elevation model (MDE) and the profile planaltimetric, the area was divided into three different landrelief and sub-strands. The highest values for nutrients from the leaves of citrus were observed at the top (concave area) segments on a half-slope and lower slope. The nutrients from the citrus leaves showed high values of correlation (above 0.5) with the altitude of the study area. The technique of geostatistics and the induction of decision tree show that the relief is the variable with the greatest potential to interpret the maps of spatial variability of nutrients from the citrus leaves.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The industries are getting more and more rigorous, when security is in question, no matter is to avoid financial damages due to accidents and low productivity, or when it s related to the environment protection. It was thinking about great world accidents around the world involving aircrafts and industrial process (nuclear, petrochemical and so on) that we decided to invest in systems that could detect fault and diagnosis (FDD) them. The FDD systems can avoid eventual fault helping man on the maintenance and exchange of defective equipments. Nowadays, the issues that involve detection, isolation, diagnose and the controlling of tolerance fault are gathering strength in the academic and industrial environment. It is based on this fact, in this work, we discuss the importance of techniques that can assist in the development of systems for Fault Detection and Diagnosis (FDD) and propose a hybrid method for FDD in dynamic systems. We present a brief history to contextualize the techniques used in working environments. The detection of fault in the proposed system is based on state observers in conjunction with other statistical techniques. The principal idea is to use the observer himself, in addition to serving as an analytical redundancy, in allowing the creation of a residue. This residue is used in FDD. A signature database assists in the identification of system faults, which based on the signatures derived from trend analysis of the residue signal and its difference, performs the classification of the faults based purely on a decision tree. This FDD system is tested and validated in two plants: a simulated plant with coupled tanks and didactic plant with industrial instrumentation. All collected results of those tests will be discussed

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Nowadays, classifying proteins in structural classes, which concerns the inference of patterns in their 3D conformation, is one of the most important open problems in Molecular Biology. The main reason for this is that the function of a protein is intrinsically related to its spatial conformation. However, such conformations are very difficult to be obtained experimentally in laboratory. Thus, this problem has drawn the attention of many researchers in Bioinformatics. Considering the great difference between the number of protein sequences already known and the number of three-dimensional structures determined experimentally, the demand of automated techniques for structural classification of proteins is very high. In this context, computational tools, especially Machine Learning (ML) techniques, have become essential to deal with this problem. In this work, ML techniques are used in the recognition of protein structural classes: Decision Trees, k-Nearest Neighbor, Naive Bayes, Support Vector Machine and Neural Networks. These methods have been chosen because they represent different paradigms of learning and have been widely used in the Bioinfornmatics literature. Aiming to obtain an improvment in the performance of these techniques (individual classifiers), homogeneous (Bagging and Boosting) and heterogeneous (Voting, Stacking and StackingC) multiclassification systems are used. Moreover, since the protein database used in this work presents the problem of imbalanced classes, artificial techniques for class balance (Undersampling Random, Tomek Links, CNN, NCL and OSS) are used to minimize such a problem. In order to evaluate the ML methods, a cross-validation procedure is applied, where the accuracy of the classifiers is measured using the mean of classification error rate, on independent test sets. These means are compared, two by two, by the hypothesis test aiming to evaluate if there is, statistically, a significant difference between them. With respect to the results obtained with the individual classifiers, Support Vector Machine presented the best accuracy. In terms of the multi-classification systems (homogeneous and heterogeneous), they showed, in general, a superior or similar performance when compared to the one achieved by the individual classifiers used - especially Boosting with Decision Tree and the StackingC with Linear Regression as meta classifier. The Voting method, despite of its simplicity, has shown to be adequate for solving the problem presented in this work. The techniques for class balance, on the other hand, have not produced a significant improvement in the global classification error. Nevertheless, the use of such techniques did improve the classification error for the minority class. In this context, the NCL technique has shown to be more appropriated

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Nitroaromatic compounds such as nifuroxazide are used in many human enteropathogenic bacteria infections without causing an increase in the plasmidial antibiotic resistance of the aerobic Gram-negative intestinal Enterobacteriaceae. For these reasons, these compounds have been synthesized using the rational approach of Topliss' decision tree. Generally. this approach allows us to obtain the most active derivative from the series in a few steps. These compounds were tested against Mycobacterium tuberculosis in vitro and the most active of the series identified. A new lead for potential tuberculostatic activity has been predicted and will be used in further QSAR studies. (C) 2002 Elsevier B.V. Ltd. All rights reserved.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The objective of the researches in artificial intelligence is to qualify the computer to execute functions that are performed by humans using knowledge and reasoning. This work was developed in the area of machine learning, that it s the study branch of artificial intelligence, being related to the project and development of algorithms and techniques capable to allow the computational learning. The objective of this work is analyzing a feature selection method for ensemble systems. The proposed method is inserted into the filter approach of feature selection method, it s using the variance and Spearman correlation to rank the feature and using the reward and punishment strategies to measure the feature importance for the identification of the classes. For each ensemble, several different configuration were used, which varied from hybrid (homogeneous) to non-hybrid (heterogeneous) structures of ensemble. They were submitted to five combining methods (voting, sum, sum weight, multiLayer Perceptron and naïve Bayes) which were applied in six distinct database (real and artificial). The classifiers applied during the experiments were k- nearest neighbor, multiLayer Perceptron, naïve Bayes and decision tree. Finally, the performance of ensemble was analyzed comparatively, using none feature selection method, using a filter approach (original) feature selection method and the proposed method. To do this comparison, a statistical test was applied, which demonstrate that there was a significant improvement in the precision of the ensembles

Relevância:

60.00% 60.00%

Publicador:

Resumo:

CONTEXTO E OBJETIVO: Gestações complicadas pelo diabetes estão associadas com aumento de complicações maternas e neonatais. Os custos hospitalares aumentam de acordo com a assistência prestada. O objetivo foi calcular o custo-benefício e a taxa de rentabilidade social da hospitalização comparada ao atendimento ambulatorial em gestantes com diabetes ou com hiperglicemia leve. DESENHO do ESTUDO: Estudo prospectivo, observacional, quantitativo, realizado em hospital universitário, sendo incluídas todas as gestantes com diabetes pregestacional e gestacional ou com hiperglicemia leve que não desenvolveram intercorrências clínicas na gestação e que tiveram parto no Hospital das Clínicas, Faculdade de Medicina de Botucatu, Universidade Estadual Paulista (HC-FMB-Unesp). MÉTODOS: Trinta gestantes tratadas com dieta foram acompanhadas em ambulatório e 20 tratadas com dieta e insulina foram abordadas com hospitalizações curtas e frequentes. Foram obtidos custos diretos (pessoal, material e exames) e indiretos (despesas gerais) a partir de dados contidos no prontuário e no sistema de custo por absorção do hospital e posteriormente calculado o custo-benefício. RESULTADOS: O sucesso do tratamento das gestantes diabéticas evitou o gasto de US$ 1.517,97 e US$ 1.127,43 para pacientes hospitalizadas e ambulatoriais, respectivamente. O custo-benefício da atenção hospitalizada foi US$ 143.719,16 e ambulatorial, US$ 253.267,22, com rentabilidade social 1,87 e 5,35 respectivamente. CONCLUSÃO: A análise árvore de decisão confirma que o sucesso dos tratamentos elimina custos no hospital. A relação custo-benefício indicou que o tratamento ambulatorial é economicamente mais vantajoso do que a hospitalização. A rentabilidade social de ambos os tratamentos foi maior que 1, indicando que ambos os tipos de atendimento à gestante diabética têm benefício positivo.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The identification of genes essential for survival is important for the understanding of the minimal requirements for cellular life and for drug design. As experimental studies with the purpose of building a catalog of essential genes for a given organism are time-consuming and laborious, a computational approach which could predict gene essentiality with high accuracy would be of great value. We present here a novel computational approach, called NTPGE (Network Topology-based Prediction of Gene Essentiality), that relies on the network topology features of a gene to estimate its essentiality. The first step of NTPGE is to construct the integrated molecular network for a given organism comprising protein physical, metabolic and transcriptional regulation interactions. The second step consists in training a decision-tree-based machine-learning algorithm on known essential and non-essential genes of the organism of interest, considering as learning attributes the network topology information for each of these genes. Finally, the decision-tree classifier generated is applied to the set of genes of this organism to estimate essentiality for each gene. We applied the NTPGE approach for discovering the essential genes in Escherichia coli and then assessed its performance. (C) 2007 Elsevier B.V. All rights reserved.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)