86 resultados para decision tree
em Indian Institute of Science - Bangalore - Índia
Resumo:
The minimum cost classifier when general cost functionsare associated with the tasks of feature measurement and classification is formulated as a decision graph which does not reject class labels at intermediate stages. Noting its complexities, a heuristic procedure to simplify this scheme to a binary decision tree is presented. The optimizationof the binary tree in this context is carried out using ynamicprogramming. This technique is applied to the voiced-unvoiced-silence classification in speech processing.
Resumo:
Land cover (LC) changes play a major role in global as well as at regional scale patterns of the climate and biogeochemistry of the Earth system. LC information presents critical insights in understanding of Earth surface phenomena, particularly useful when obtained synoptically from remote sensing data. However, for developing countries and those with large geographical extent, regular LC mapping is prohibitive with data from commercial sensors (high cost factor) of limited spatial coverage (low temporal resolution and band swath). In this context, free MODIS data with good spectro-temporal resolution meet the purpose. LC mapping from these data has continuously evolved with advances in classification algorithms. This paper presents a comparative study of two robust data mining techniques, the multilayer perceptron (MLP) and decision tree (DT) on different products of MODIS data corresponding to Kolar district, Karnataka, India. The MODIS classified images when compared at three different spatial scales (at district level, taluk level and pixel level) shows that MLP based classification on minimum noise fraction components on MODIS 36 bands provide the most accurate LC mapping with 86% accuracy, while DT on MODIS 36 bands principal components leads to less accurate classification (69%).
Resumo:
The design and operation of the minimum cost classifier, where the total cost is the sum of the measurement cost and the classification cost, is computationally complex. Noting the difficulties associated with this approach, decision tree design directly from a set of labelled samples is proposed in this paper. The feature space is first partitioned to transform the problem to one of discrete features. The resulting problem is solved by a dynamic programming algorithm over an explicitly ordered state space of all outcomes of all feature subsets. The solution procedure is very general and is applicable to any minimum cost pattern classification problem in which each feature has a finite number of outcomes. These techniques are applied to (i) voiced, unvoiced, and silence classification of speech, and (ii) spoken vowel recognition. The resulting decision trees are operationally very efficient and yield attractive classification accuracies.
Resumo:
In this paper, we present a new algorithm for learning oblique decision trees. Most of the current decision tree algorithms rely on impurity measures to assess the goodness of hyperplanes at each node while learning a decision tree in top-down fashion. These impurity measures do not properly capture the geometric structures in the data. Motivated by this, our algorithm uses a strategy for assessing the hyperplanes in such a way that the geometric structure in the data is taken into account. At each node of the decision tree, we find the clustering hyperplanes for both the classes and use their angle bisectors as the split rule at that node. We show through empirical studies that this idea leads to small decision trees and better performance. We also present some analysis to show that the angle bisectors of clustering hyperplanes that we use as the split rules at each node are solutions of an interesting optimization problem and hence argue that this is a principled method of learning a decision tree.
Resumo:
In this paper we present a novel algorithm for learning oblique decision trees. Most of the current decision tree algorithms rely on impurity measures to assess goodness of hyperplanes at each node. These impurity measures do not properly capture the geometric structures in the data. Motivated by this, our algorithm uses a strategy, based on some recent variants of SVM, to assess the hyperplanes in such a way that the geometric structure in the data is taken into account. We show through empirical studies that our method is effective.
Resumo:
Design of speaker identification schemes for a small number of speakers (around 10) with a high degree of accuracy in controlled environment is a practical proposition today. When the number of speakers is large (say 50–100), many of these schemes cannot be directly extended, as both recognition error and computation time increase monotonically with population size. The feature selection problem is also complex for such schemes. Though there were earlier attempts to rank order features based on statistical distance measures, it has been observed only recently that the best two independent measurements are not the same as the combination in two's for pattern classification. We propose here a systematic approach to the problem using the decision tree or hierarchical classifier with the following objectives: (1) Design of optimal policy at each node of the tree given the tree structure i.e., the tree skeleton and the features to be used at each node. (2) Determination of the optimal feature measurement and decision policy given only the tree skeleton. Applicability of optimization procedures such as dynamic programming in the design of such trees is studied. The experimental results deal with the design of a 50 speaker identification scheme based on this approach.
Resumo:
In this paper we propose a new algorithm for learning polyhedral classifiers. In contrast to existing methods for learning polyhedral classifier which solve a constrained optimization problem, our method solves an unconstrained optimization problem. Our method is based on a logistic function based model for the posterior probability function. We propose an alternating optimization algorithm, namely, SPLA1 (Single Polyhedral Learning Algorithm1) which maximizes the loglikelihood of the training data to learn the parameters. We also extend our method to make it independent of any user specified parameter (e.g., number of hyperplanes required to form a polyhedral set) in SPLA2. We show the effectiveness of our approach with experiments on various synthetic and real world datasets and compare our approach with a standard decision tree method (OC1) and a constrained optimization based method for learning polyhedral sets.
Resumo:
In data mining, an important goal is to generate an abstraction of the data. Such an abstraction helps in reducing the space and search time requirements of the overall decision making process. Further, it is important that the abstraction is generated from the data with a small number of disk scans. We propose a novel data structure, pattern count tree (PC-tree), that can be built by scanning the database only once. PC-tree is a minimal size complete representation of the data and it can be used to represent dynamic databases with the help of knowledge that is either static or changing. We show that further compactness can be achieved by constructing the PC-tree on segmented patterns. We exploit the flexibility offered by rough sets to realize a rough PC-tree and use it for efficient and effective rough classification. To be consistent with the sizes of the branches of the PC-tree, we use upper and lower approximations of feature sets in a manner different from the conventional rough set theory. We conducted experiments using the proposed classification scheme on a large-scale hand-written digit data set. We use the experimental results to establish the efficacy of the proposed approach. (C) 2002 Elsevier Science B.V. All rights reserved.
Resumo:
Several techniques are known for searching an ordered collection of data. The techniques and analyses of retrieval methods based on primary attributes are straightforward. Retrieval using secondary attributes depends on several factors. For secondary attribute retrieval, the linear structures—inverted lists, multilists, doubly linked lists—and the recently proposed nonlinear tree structures—multiple attribute tree (MAT), K-d tree (kdT)—have their individual merits. It is shown in this paper that, of the two tree structures, MAT possesses several features of a systematic data structure for external file organisation which make it superior to kdT. Analytic estimates for the complexity of node searchers, in MAT and kdT for several types of queries, are developed and compared.
Resumo:
A variety of data structures such as inverted file, multi-lists, quad tree, k-d tree, range tree, polygon tree, quintary tree, multidimensional tries, segment tree, doubly chained tree, the grid file, d-fold tree. super B-tree, Multiple Attribute Tree (MAT), etc. have been studied for multidimensional searching and related problems. Physical data base organization, which is an important application of multidimensional searching, is traditionally and mostly handled by employing inverted file. This study proposes MAT data structure for bibliographic file systems, by illustrating the superiority of MAT data structure over inverted file. Both the methods are compared in terms of preprocessing, storage and query costs. Worst-case complexity analysis of both the methods, for a partial match query, is carried out in two cases: (a) when directory resides in main memory, (b) when directory resides in secondary memory. In both cases, MAT data structure is shown to be more efficient than the inverted file method. Arguments are given to illustrate the superiority of MAT data structure in an average case also. An efficient adaptation of MAT data structure, that exploits the special features of MAT structure and bibliographic files, is proposed for bibliographic file systems. In this adaptation, suitable techniques for fixing and ranking of the attributes for MAT data structure are proposed. Conclusions and proposals for future research are presented.
Resumo:
Abstract-To detect errors in decision tables one needs to decide whether a given set of constraints is feasible or not. This paper describes an algorithm to do so when the constraints are linear in variables that take only integer values. Decision tables with such constraints occur frequently in business data processing and in nonnumeric applications. The aim of the algorithm is to exploit. the abundance of very simple constraints that occur in typical decision table contexts. Essentially, the algorithm is a backtrack procedure where the the solution space is pruned by using the set of simple constrains. After some simplications, the simple constraints are captured in an acyclic directed graph with weighted edges. Further, only those partial vectors are considered from extension which can be extended to assignments that will at least satisfy the simple constraints. This is how pruning of the solution space is achieved. For every partial assignment considered, the graph representation of the simple constraints provides a lower bound for each variable which is not yet assigned a value. These lower bounds play a vital role in the algorithm and they are obtained in an efficient manner by updating older lower bounds. Our present algorithm also incorporates an idea by which it can be checked whether or not an (m - 2)-ary vector can be extended to a solution vector of m components, thereby backtracking is reduced by one component.
Resumo:
We present here the first statistically calibrated and verified tree-ring reconstruction of climate from continental Southeast Asia.The reconstructed variable is March-May (MAM) Palmer Drought Severity Index (PDSI) based on ring widths from 22 trees (42 radial cores) of rare and long-lived conifer, Fokienia hodginsii (Po Mu as locally called) from northern Vietnam. This is the first published tree ring chronology from Vietnam as well as the first for this species. Spanning 535 years, this is the longest cross-dated tree-ring series yet produced from continental Southeast Asia. Response analysis revealed that the annual growth of Fokienia at this site was mostly governed by soil moisture in the pre-monsoon season. The reconstruction passed the calibration-verification tests commonly used in dendroclimatology, and revealed two prominent periods of drought in the mid-eighteenth and late-nineteenth enturies. The former lasted nearly 30 years and was concurrent with a similar drought over northwestern Thailand inferred from teak rings, suggesting a ``mega-drought'' extending across Indochina in the eighteenth century. Both of our reconstructed droughts are consistent with the periods of warm sea surface temperature (SST)anomalies in the tropical Pacific. Spatial correlation analyses with global SST indicated that ENSO-like anomalies might play a role in modulating droughts over the region, with El Nio (warm) phases resulting in reduced rainfall. However, significant correlation was also seen with SST over the Indian Ocean and the north Pacific,suggesting that ENSO is not the only factor affecting the climate of the area. Spectral analyses revealed significant peaks in the range of 53.9-78.8 years as well as in the ENSO-variability range of 2.0 to 3.2 years.
Resumo:
We present a fast algorithm for computing a Gomory-Hu tree or cut tree for an unweighted undirected graph G = (V, E). The expected running time of our algorithm is (O) over tilde (mc) where vertical bar E vertical bar = m and c is the maximum u-v edge connectivity, where u, v is an element of V. When the input graph is also simple (i.e., it has no parallel edges), then the u-v edge connectivity for each pair of vertices u and v is at most n - 1; so the expected run-ning time of our algorithm for simple unweighted graphs is (O) over tilde (mn). All the algorithms currently known for constructing a Gomory-Hu tree [8, 9] use n - 1 minimum s-t cut (i.e., max flow) subroutines. This in conjunction with the current fastest (O) over tilde (n(20/9)) max flow algorithm due to Karger and Levine[11] yields the current best running time of (O) over tilde (n(20/9)n) for Gomory-Hu tree construction on simple unweighted graphs with m edges and n vertices. Thus we present the first (O) over tilde (mn) algorithm for constructing a Gomory-Hu tree for simple unweighted graphs. We do not use a max flow subroutine here; we present an efficient tree packing algorithm for computing Steiner edge connectivity and use this algorithm as our main subroutine. The advantage in using a tree packing algorithm for constructing a Gomory-Hu tree is that the work done in computing a minimum Steiner cut for a Steiner set S subset of V can be reused for computing a minimum Steiner cut for certain Steiner sets S' subset of S.
Resumo:
Hornbills are important dispersers of a wide range of tree species. Many of these species bear fruits with large, lipid-rich seeds that could attract terrestrial rodents. Rodents have multiple effects on seed fates, many of which remain poorly understood in the Palaeotropics. The role of terrestrial rodents was investigated by tracking seed fate of five horn bill-dispersed tree species in a tropical forest in north-cast India. Seeds were marked inside and outside of exclosures below 6-12 parent fruiting trees (undispersed seed rain) and six hornbill nest trees (a post-dispersal site). Rodent visitors and seed removal ere monitored using camera traps. Our findings suggest that several rodent species. especially two species of porcupine were major on-site seed predators. Scatter-hoarding was rare (1.4%). Seeds at hornbill nest trees had lower survival compared with parent fruiting trees, indicating that clumped dispersal by hornbills may not necessarily improve seed survival. Seed survival in the presence and absence of rodents varied with tree species. Some species (e.g. Polyalthia simiarum) showed no difference, others (e.g. Dysoxylum binectariferum) experienced up to a 64%. decrease in survival in the presence of rodents. The differing magnitude of seed predation by rodents can have significant consequences at the seed establishment stage.