11 resultados para internet traffic classification machine learning apache spark hadoop big data word2vec
em Biblioteca Digital da Produção Intelectual da Universidade de São Paulo (BDPI/USP)
Resumo:
Species` potential distribution modelling consists of building a representation of the fundamental ecological requirements of a species from biotic and abiotic conditions where the species is known to occur. Such models can be valuable tools to understand the biogeography of species and to support the prediction of its presence/absence considering a particular environment scenario. This paper investigates the use of different supervised machine learning techniques to model the potential distribution of 35 plant species from Latin America. Each technique was able to extract a different representation of the relations between the environmental conditions and the distribution profile of the species. The experimental results highlight the good performance of random trees classifiers, indicating this particular technique as a promising candidate for modelling species` potential distribution. (C) 2010 Elsevier Ltd. All rights reserved.
Resumo:
Model trees are a particular case of decision trees employed to solve regression problems. They have the advantage of presenting an interpretable output, helping the end-user to get more confidence in the prediction and providing the basis for the end-user to have new insight about the data, confirming or rejecting hypotheses previously formed. Moreover, model trees present an acceptable level of predictive performance in comparison to most techniques used for solving regression problems. Since generating the optimal model tree is an NP-Complete problem, traditional model tree induction algorithms make use of a greedy top-down divide-and-conquer strategy, which may not converge to the global optimal solution. In this paper, we propose a novel algorithm based on the use of the evolutionary algorithms paradigm as an alternate heuristic to generate model trees in order to improve the convergence to globally near-optimal solutions. We call our new approach evolutionary model tree induction (E-Motion). We test its predictive performance using public UCI data sets, and we compare the results to traditional greedy regression/model trees induction algorithms, as well as to other evolutionary approaches. Results show that our method presents a good trade-off between predictive performance and model comprehensibility, which may be crucial in many machine learning applications. (C) 2010 Elsevier Inc. All rights reserved.
Resumo:
Predictive performance evaluation is a fundamental issue in design, development, and deployment of classification systems. As predictive performance evaluation is a multidimensional problem, single scalar summaries such as error rate, although quite convenient due to its simplicity, can seldom evaluate all the aspects that a complete and reliable evaluation must consider. Due to this, various graphical performance evaluation methods are increasingly drawing the attention of machine learning, data mining, and pattern recognition communities. The main advantage of these types of methods resides in their ability to depict the trade-offs between evaluation aspects in a multidimensional space rather than reducing these aspects to an arbitrarily chosen (and often biased) single scalar measure. Furthermore, to appropriately select a suitable graphical method for a given task, it is crucial to identify its strengths and weaknesses. This paper surveys various graphical methods often used for predictive performance evaluation. By presenting these methods in the same framework, we hope this paper may shed some light on deciding which methods are more suitable to use in different situations.
Resumo:
Various popular machine learning techniques, like support vector machines, are originally conceived for the solution of two-class (binary) classification problems. However, a large number of real problems present more than two classes. A common approach to generalize binary learning techniques to solve problems with more than two classes, also known as multiclass classification problems, consists of hierarchically decomposing the multiclass problem into multiple binary sub-problems, whose outputs are combined to define the predicted class. This strategy results in a tree of binary classifiers, where each internal node corresponds to a binary classifier distinguishing two groups of classes and the leaf nodes correspond to the problem classes. This paper investigates how measures of the separability between classes can be employed in the construction of binary-tree-based multiclass classifiers, adapting the decompositions performed to each particular multiclass problem. (C) 2010 Elsevier B.V. All rights reserved.
Resumo:
Several real problems involve the classification of data into categories or classes. Given a data set containing data whose classes are known, Machine Learning algorithms can be employed for the induction of a classifier able to predict the class of new data from the same domain, performing the desired discrimination. Some learning techniques are originally conceived for the solution of problems with only two classes, also named binary classification problems. However, many problems require the discrimination of examples into more than two categories or classes. This paper presents a survey on the main strategies for the generalization of binary classifiers to problems with more than two classes, known as multiclass classification problems. The focus is on strategies that decompose the original multiclass problem into multiple binary subtasks, whose outputs are combined to obtain the final prediction.
Resumo:
Support vector machines (SVMs) were originally formulated for the solution of binary classification problems. In multiclass problems, a decomposition approach is often employed, in which the multiclass problem is divided into multiple binary subproblems, whose results are combined. Generally, the performance of SVM classifiers is affected by the selection of values for their parameters. This paper investigates the use of genetic algorithms (GAs) to tune the parameters of the binary SVMs in common multiclass decompositions. The developed GA may search for a set of parameter values common to all binary classifiers or for differentiated values for each binary classifier. (C) 2008 Elsevier B.V. All rights reserved.
Resumo:
Several popular Machine Learning techniques are originally designed for the solution of two-class problems. However, several classification problems have more than two classes. One approach to deal with multiclass problems using binary classifiers is to decompose the multiclass problem into multiple binary sub-problems disposed in a binary tree. This approach requires a binary partition of the classes for each node of the tree, which defines the tree structure. This paper presents two algorithms to determine the tree structure taking into account information collected from the used dataset. This approach allows the tree structure to be determined automatically for any multiclass dataset.
Resumo:
There is an increasing interest in the application of Evolutionary Algorithms (EAs) to induce classification rules. This hybrid approach can benefit areas where classical methods for rule induction have not been very successful. One example is the induction of classification rules in imbalanced domains. Imbalanced data occur when one or more classes heavily outnumber other classes. Frequently, classical machine learning (ML) classifiers are not able to learn in the presence of imbalanced data sets, inducing classification models that always predict the most numerous classes. In this work, we propose a novel hybrid approach to deal with this problem. We create several balanced data sets with all minority class cases and a random sample of majority class cases. These balanced data sets are fed to classical ML systems that produce rule sets. The rule sets are combined creating a pool of rules and an EA is used to build a classifier from this pool of rules. This hybrid approach has some advantages over undersampling, since it reduces the amount of discarded information, and some advantages over oversampling, since it avoids overfitting. The proposed approach was experimentally analysed and the experimental results show an improvement in the classification performance measured as the area under the receiver operating characteristics (ROC) curve.
Resumo:
Robotic mapping is the process of automatically constructing an environment representation using mobile robots. We address the problem of semantic mapping, which consists of using mobile robots to create maps that represent not only metric occupancy but also other properties of the environment. Specifically, we develop techniques to build maps that represent activity and navigability of the environment. Our approach to semantic mapping is to combine machine learning techniques with standard mapping algorithms. Supervised learning methods are used to automatically associate properties of space to the desired classification patterns. We present two methods, the first based on hidden Markov models and the second on support vector machines. Both approaches have been tested and experimentally validated in two problem domains: terrain mapping and activity-based mapping.
Resumo:
Establishing metrics to assess machine translation (MT) systems automatically is now crucial owing to the widespread use of MT over the web. In this study we show that such evaluation can be done by modeling text as complex networks. Specifically, we extend our previous work by employing additional metrics of complex networks, whose results were used as input for machine learning methods and allowed MT texts of distinct qualities to be distinguished. Also shown is that the node-to-node mapping between source and target texts (English-Portuguese and Spanish-Portuguese pairs) can be improved by adding further hierarchical levels for the metrics out-degree, in-degree, hierarchical common degree, cluster coefficient, inter-ring degree, intra-ring degree and convergence ratio. The results presented here amount to a proof-of-principle that the possible capturing of a wider context with the hierarchical levels may be combined with machine learning methods to yield an approach for assessing the quality of MT systems. (C) 2010 Elsevier B.V. All rights reserved.
Resumo:
The design of binary morphological operators that are translation-invariant and locally defined by a finite neighborhood window corresponds to the problem of designing Boolean functions. As in any supervised classification problem, morphological operators designed from a training sample also suffer from overfitting. Large neighborhood tends to lead to performance degradation of the designed operator. This work proposes a multilevel design approach to deal with the issue of designing large neighborhood-based operators. The main idea is inspired by stacked generalization (a multilevel classifier design approach) and consists of, at each training level, combining the outcomes of the previous level operators. The final operator is a multilevel operator that ultimately depends on a larger neighborhood than of the individual operators that have been combined. Experimental results show that two-level operators obtained by combining operators designed on subwindows of a large window consistently outperform the single-level operators designed on the full window. They also show that iterating two-level operators is an effective multilevel approach to obtain better results.