907 resultados para optimal feature selection
Resumo:
Feature selection is one of important and frequently used techniques in data preprocessing. It can improve the efficiency and the effectiveness of data mining by reducing the dimensions of feature space and removing the irrelevant and redundant information. Feature selection can be viewed as a global optimization problem of finding a minimum set of M relevant features that describes the dataset as well as the original N attributes. In this paper, we apply the adaptive partitioned random search strategy into our feature selection algorithm. Under this search strategy, the partition structure and evaluation function is proposed for feature selection problem. This algorithm ensures the global optimal solution in theory and avoids complete randomness in search direction. The good property of our algorithm is shown through the theoretical analysis.
Resumo:
Executive Summary The unifying theme of this thesis is the pursuit of a satisfactory ways to quantify the riskureward trade-off in financial economics. First in the context of a general asset pricing model, then across models and finally across country borders. The guiding principle in that pursuit was to seek innovative solutions by combining ideas from different fields in economics and broad scientific research. For example, in the first part of this thesis we sought a fruitful application of strong existence results in utility theory to topics in asset pricing. In the second part we implement an idea from the field of fuzzy set theory to the optimal portfolio selection problem, while the third part of this thesis is to the best of our knowledge, the first empirical application of some general results in asset pricing in incomplete markets to the important topic of measurement of financial integration. While the first two parts of this thesis effectively combine well-known ways to quantify the risk-reward trade-offs the third one can be viewed as an empirical verification of the usefulness of the so-called "good deal bounds" theory in designing risk-sensitive pricing bounds. Chapter 1 develops a discrete-time asset pricing model, based on a novel ordinally equivalent representation of recursive utility. To the best of our knowledge, we are the first to use a member of a novel class of recursive utility generators to construct a representative agent model to address some long-lasting issues in asset pricing. Applying strong representation results allows us to show that the model features countercyclical risk premia, for both consumption and financial risk, together with low and procyclical risk free rate. As the recursive utility used nests as a special case the well-known time-state separable utility, all results nest the corresponding ones from the standard model and thus shed light on its well-known shortcomings. The empirical investigation to support these theoretical results, however, showed that as long as one resorts to econometric methods based on approximating conditional moments with unconditional ones, it is not possible to distinguish the model we propose from the standard one. Chapter 2 is a join work with Sergei Sontchik. There we provide theoretical and empirical motivation for aggregation of performance measures. The main idea is that as it makes sense to apply several performance measures ex-post, it also makes sense to base optimal portfolio selection on ex-ante maximization of as many possible performance measures as desired. We thus offer a concrete algorithm for optimal portfolio selection via ex-ante optimization over different horizons of several risk-return trade-offs simultaneously. An empirical application of that algorithm, using seven popular performance measures, suggests that realized returns feature better distributional characteristics relative to those of realized returns from portfolio strategies optimal with respect to single performance measures. When comparing the distributions of realized returns we used two partial risk-reward orderings first and second order stochastic dominance. We first used the Kolmogorov Smirnov test to determine if the two distributions are indeed different, which combined with a visual inspection allowed us to demonstrate that the way we propose to aggregate performance measures leads to portfolio realized returns that first order stochastically dominate the ones that result from optimization only with respect to, for example, Treynor ratio and Jensen's alpha. We checked for second order stochastic dominance via point wise comparison of the so-called absolute Lorenz curve, or the sequence of expected shortfalls for a range of quantiles. As soon as the plot of the absolute Lorenz curve for the aggregated performance measures was above the one corresponding to each individual measure, we were tempted to conclude that the algorithm we propose leads to portfolio returns distribution that second order stochastically dominates virtually all performance measures considered. Chapter 3 proposes a measure of financial integration, based on recent advances in asset pricing in incomplete markets. Given a base market (a set of traded assets) and an index of another market, we propose to measure financial integration through time by the size of the spread between the pricing bounds of the market index, relative to the base market. The bigger the spread around country index A, viewed from market B, the less integrated markets A and B are. We investigate the presence of structural breaks in the size of the spread for EMU member country indices before and after the introduction of the Euro. We find evidence that both the level and the volatility of our financial integration measure increased after the introduction of the Euro. That counterintuitive result suggests the presence of an inherent weakness in the attempt to measure financial integration independently of economic fundamentals. Nevertheless, the results about the bounds on the risk free rate appear plausible from the view point of existing economic theory about the impact of integration on interest rates.
Resumo:
Personalized medicine will revolutionize our capabilities to combat disease. Working toward this goal, a fundamental task is the deciphering of geneticvariants that are predictive of complex diseases. Modern studies, in the formof genome-wide association studies (GWAS) have afforded researchers with the opportunity to reveal new genotype-phenotype relationships through the extensive scanning of genetic variants. These studies typically contain over half a million genetic features for thousands of individuals. Examining this with methods other than univariate statistics is a challenging task requiring advanced algorithms that are scalable to the genome-wide level. In the future, next-generation sequencing studies (NGS) will contain an even larger number of common and rare variants. Machine learning-based feature selection algorithms have been shown to have the ability to effectively create predictive models for various genotype-phenotype relationships. This work explores the problem of selecting genetic variant subsets that are the most predictive of complex disease phenotypes through various feature selection methodologies, including filter, wrapper and embedded algorithms. The examined machine learning algorithms were demonstrated to not only be effective at predicting the disease phenotypes, but also doing so efficiently through the use of computational shortcuts. While much of the work was able to be run on high-end desktops, some work was further extended so that it could be implemented on parallel computers helping to assure that they will also scale to the NGS data sets. Further, these studies analyzed the relationships between various feature selection methods and demonstrated the need for careful testing when selecting an algorithm. It was shown that there is no universally optimal algorithm for variant selection in GWAS, but rather methodologies need to be selected based on the desired outcome, such as the number of features to be included in the prediction model. It was also demonstrated that without proper model validation, for example using nested cross-validation, the models can result in overly-optimistic prediction accuracies and decreased generalization ability. It is through the implementation and application of machine learning methods that one can extract predictive genotype–phenotype relationships and biological insights from genetic data sets.
Resumo:
The curse of dimensionality is a major problem in the fields of machine learning, data mining and knowledge discovery. Exhaustive search for the most optimal subset of relevant features from a high dimensional dataset is NP hard. Sub–optimal population based stochastic algorithms such as GP and GA are good choices for searching through large search spaces, and are usually more feasible than exhaustive and deterministic search algorithms. On the other hand, population based stochastic algorithms often suffer from premature convergence on mediocre sub–optimal solutions. The Age Layered Population Structure (ALPS) is a novel metaheuristic for overcoming the problem of premature convergence in evolutionary algorithms, and for improving search in the fitness landscape. The ALPS paradigm uses an age–measure to control breeding and competition between individuals in the population. This thesis uses a modification of the ALPS GP strategy called Feature Selection ALPS (FSALPS) for feature subset selection and classification of varied supervised learning tasks. FSALPS uses a novel frequency count system to rank features in the GP population based on evolved feature frequencies. The ranked features are translated into probabilities, which are used to control evolutionary processes such as terminal–symbol selection for the construction of GP trees/sub-trees. The FSALPS metaheuristic continuously refines the feature subset selection process whiles simultaneously evolving efficient classifiers through a non–converging evolutionary process that favors selection of features with high discrimination of class labels. We investigated and compared the performance of canonical GP, ALPS and FSALPS on high–dimensional benchmark classification datasets, including a hyperspectral image. Using Tukey’s HSD ANOVA test at a 95% confidence interval, ALPS and FSALPS dominated canonical GP in evolving smaller but efficient trees with less bloat expressions. FSALPS significantly outperformed canonical GP and ALPS and some reported feature selection strategies in related literature on dimensionality reduction.
Resumo:
The curse of dimensionality is a major problem in the fields of machine learning, data mining and knowledge discovery. Exhaustive search for the most optimal subset of relevant features from a high dimensional dataset is NP hard. Sub–optimal population based stochastic algorithms such as GP and GA are good choices for searching through large search spaces, and are usually more feasible than exhaustive and determinis- tic search algorithms. On the other hand, population based stochastic algorithms often suffer from premature convergence on mediocre sub–optimal solutions. The Age Layered Population Structure (ALPS) is a novel meta–heuristic for overcoming the problem of premature convergence in evolutionary algorithms, and for improving search in the fitness landscape. The ALPS paradigm uses an age–measure to control breeding and competition between individuals in the population. This thesis uses a modification of the ALPS GP strategy called Feature Selection ALPS (FSALPS) for feature subset selection and classification of varied supervised learning tasks. FSALPS uses a novel frequency count system to rank features in the GP population based on evolved feature frequencies. The ranked features are translated into probabilities, which are used to control evolutionary processes such as terminal–symbol selection for the construction of GP trees/sub-trees. The FSALPS meta–heuristic continuously refines the feature subset selection process whiles simultaneously evolving efficient classifiers through a non–converging evolutionary process that favors selection of features with high discrimination of class labels. We investigated and compared the performance of canonical GP, ALPS and FSALPS on high–dimensional benchmark classification datasets, including a hyperspectral image. Using Tukey’s HSD ANOVA test at a 95% confidence interval, ALPS and FSALPS dominated canonical GP in evolving smaller but efficient trees with less bloat expressions. FSALPS significantly outperformed canonical GP and ALPS and some reported feature selection strategies in related literature on dimensionality reduction.
Resumo:
This paper presents the formulation of a combinatorial optimization problem with the following characteristics: (i) the search space is the power set of a finite set structured as a Boolean lattice; (ii) the cost function forms a U-shaped curve when applied to any lattice chain. This formulation applies for feature selection in the context of pattern recognition. The known approaches for this problem are branch-and-bound algorithms and heuristics that explore partially the search space. Branch-and-bound algorithms are equivalent to the full search, while heuristics are not. This paper presents a branch-and-bound algorithm that differs from the others known by exploring the lattice structure and the U-shaped chain curves of the search space. The main contribution of this paper is the architecture of this algorithm that is based on the representation and exploration of the search space by new lattice properties proven here. Several experiments, with well known public data, indicate the superiority of the proposed method to the sequential floating forward selection (SFFS), which is a popular heuristic that gives good results in very short computational time. In all experiments, the proposed method got better or equal results in similar or even smaller computational time. (C) 2009 Elsevier Ltd. All rights reserved.
Resumo:
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)
Resumo:
Piotr Omenzetter and Simon Hoell’s work within the Lloyd’s Register Foundation Centre for Safety and Reliability Engineering at the University of Aberdeen is supported by Lloyd’s Register Foundation. The Foundation helps to protect life and property by supporting engineering-related education, public engagement and the application of research.
Resumo:
En este artículo se investigan técnicas automáticas para encontrar un modelo óptimo de características en el caso de un analizador de dependencias basado en transiciones. Mostramos un estudio comparativo entre algoritmos de búsqueda, sistemas de validación y reglas de decisión demostrando al mismo tiempo que usando nuestros métodos es posible conseguir modelos complejos que proporcionan mejores resultados que los modelos que siguen configuraciones por defecto.
Resumo:
Non-technical losses (NTL) identification and prediction are important tasks for many utilities. Data from customer information system (CIS) can be used for NTL analysis. However, in order to accurately and efficiently perform NTL analysis, the original data from CIS need to be pre-processed before any detailed NTL analysis can be carried out. In this paper, we propose a feature selection based method for CIS data pre-processing in order to extract the most relevant information for further analysis such as clustering and classifications. By removing irrelevant and redundant features, feature selection is an essential step in data mining process in finding optimal subset of features to improve the quality of result by giving faster time processing, higher accuracy and simpler results with fewer features. Detailed feature selection analysis is presented in the paper. Both time-domain and load shape data are compared based on the accuracy, consistency and statistical dependencies between features.
Resumo:
Piotr Omenzetter and Simon Hoell's work within the Lloyd's Register Foundation Centre for Safety and Reliability Engineering at the University of Aberdeen is supported by Lloyd’s Register Foundation. The Foundation helps to protect life and property by supporting engineering-related education, public engagement and the application of research.
Resumo:
Piotr Omenzetter and Simon Hoell's work within the Lloyd's Register Foundation Centre for Safety and Reliability Engineering at the University of Aberdeen is supported by Lloyd’s Register Foundation. The Foundation helps to protect life and property by supporting engineering-related education, public engagement and the application of research.
Resumo:
Piotr Omenzetter and Simon Hoell’s work within the Lloyd’s Register Foundation Centre for Safety and Reliability Engineering at the University of Aberdeen is supported by Lloyd’s Register Foundation. The Foundation helps to protect life and property by supporting engineering-related education, public engagement and the application of research.
Resumo:
Fitting statistical models is computationally challenging when the sample size or the dimension of the dataset is huge. An attractive approach for down-scaling the problem size is to first partition the dataset into subsets and then fit using distributed algorithms. The dataset can be partitioned either horizontally (in the sample space) or vertically (in the feature space), and the challenge arise in defining an algorithm with low communication, theoretical guarantees and excellent practical performance in general settings. For sample space partitioning, I propose a MEdian Selection Subset AGgregation Estimator ({\em message}) algorithm for solving these issues. The algorithm applies feature selection in parallel for each subset using regularized regression or Bayesian variable selection method, calculates the `median' feature inclusion index, estimates coefficients for the selected features in parallel for each subset, and then averages these estimates. The algorithm is simple, involves very minimal communication, scales efficiently in sample size, and has theoretical guarantees. I provide extensive experiments to show excellent performance in feature selection, estimation, prediction, and computation time relative to usual competitors.
While sample space partitioning is useful in handling datasets with large sample size, feature space partitioning is more effective when the data dimension is high. Existing methods for partitioning features, however, are either vulnerable to high correlations or inefficient in reducing the model dimension. In the thesis, I propose a new embarrassingly parallel framework named {\em DECO} for distributed variable selection and parameter estimation. In {\em DECO}, variables are first partitioned and allocated to m distributed workers. The decorrelated subset data within each worker are then fitted via any algorithm designed for high-dimensional problems. We show that by incorporating the decorrelation step, DECO can achieve consistent variable selection and parameter estimation on each subset with (almost) no assumptions. In addition, the convergence rate is nearly minimax optimal for both sparse and weakly sparse models and does NOT depend on the partition number m. Extensive numerical experiments are provided to illustrate the performance of the new framework.
For datasets with both large sample sizes and high dimensionality, I propose a new "divided-and-conquer" framework {\em DEME} (DECO-message) by leveraging both the {\em DECO} and the {\em message} algorithm. The new framework first partitions the dataset in the sample space into row cubes using {\em message} and then partition the feature space of the cubes using {\em DECO}. This procedure is equivalent to partitioning the original data matrix into multiple small blocks, each with a feasible size that can be stored and fitted in a computer in parallel. The results are then synthezied via the {\em DECO} and {\em message} algorithm in a reverse order to produce the final output. The whole framework is extremely scalable.
Resumo:
Background: Feature selection is a pattern recognition approach to choose important variables according to some criteria in order to distinguish or explain certain phenomena (i.e., for dimensionality reduction). There are many genomic and proteomic applications that rely on feature selection to answer questions such as selecting signature genes which are informative about some biological state, e. g., normal tissues and several types of cancer; or inferring a prediction network among elements such as genes, proteins and external stimuli. In these applications, a recurrent problem is the lack of samples to perform an adequate estimate of the joint probabilities between element states. A myriad of feature selection algorithms and criterion functions have been proposed, although it is difficult to point the best solution for each application. Results: The intent of this work is to provide an open-source multiplataform graphical environment for bioinformatics problems, which supports many feature selection algorithms, criterion functions and graphic visualization tools such as scatterplots, parallel coordinates and graphs. A feature selection approach for growing genetic networks from seed genes ( targets or predictors) is also implemented in the system. Conclusion: The proposed feature selection environment allows data analysis using several algorithms, criterion functions and graphic visualization tools. Our experiments have shown the software effectiveness in two distinct types of biological problems. Besides, the environment can be used in different pattern recognition applications, although the main concern regards bioinformatics tasks.