874 resultados para correlation-based feature selection
Resumo:
This paper presents a kernel density correlation based nonrigid point set matching method and shows its application in statistical model based 2D/3D reconstruction of a scaled, patient-specific model from an un-calibrated x-ray radiograph. In this method, both the reference point set and the floating point set are first represented using kernel density estimates. A correlation measure between these two kernel density estimates is then optimized to find a displacement field such that the floating point set is moved to the reference point set. Regularizations based on the overall deformation energy and the motion smoothness energy are used to constraint the displacement field for a robust point set matching. Incorporating this non-rigid point set matching method into a statistical model based 2D/3D reconstruction framework, we can reconstruct a scaled, patient-specific model from noisy edge points that are extracted directly from the x-ray radiograph by an edge detector. Our experiment conducted on datasets of two patients and six cadavers demonstrates a mean reconstruction error of 1.9 mm
Resumo:
Automatic blood glucose classification may help specialists to provide a better interpretation of blood glucose data, downloaded directly from patients glucose meter and will contribute in the development of decision support systems for gestational diabetes. This paper presents an automatic blood glucose classifier for gestational diabetes that compares 6 different feature selection methods for two machine learning algorithms: neural networks and decision trees. Three searching algorithms, Greedy, Best First and Genetic, were combined with two different evaluators, CSF and Wrapper, for the feature selection. The study has been made with 6080 blood glucose measurements from 25 patients. Decision trees with a feature set selected with the Wrapper evaluator and the Best first search algorithm obtained the best accuracy: 95.92%.
Resumo:
Publisher PDF
Resumo:
Fitting statistical models is computationally challenging when the sample size or the dimension of the dataset is huge. An attractive approach for down-scaling the problem size is to first partition the dataset into subsets and then fit using distributed algorithms. The dataset can be partitioned either horizontally (in the sample space) or vertically (in the feature space), and the challenge arise in defining an algorithm with low communication, theoretical guarantees and excellent practical performance in general settings. For sample space partitioning, I propose a MEdian Selection Subset AGgregation Estimator ({\em message}) algorithm for solving these issues. The algorithm applies feature selection in parallel for each subset using regularized regression or Bayesian variable selection method, calculates the `median' feature inclusion index, estimates coefficients for the selected features in parallel for each subset, and then averages these estimates. The algorithm is simple, involves very minimal communication, scales efficiently in sample size, and has theoretical guarantees. I provide extensive experiments to show excellent performance in feature selection, estimation, prediction, and computation time relative to usual competitors.
While sample space partitioning is useful in handling datasets with large sample size, feature space partitioning is more effective when the data dimension is high. Existing methods for partitioning features, however, are either vulnerable to high correlations or inefficient in reducing the model dimension. In the thesis, I propose a new embarrassingly parallel framework named {\em DECO} for distributed variable selection and parameter estimation. In {\em DECO}, variables are first partitioned and allocated to m distributed workers. The decorrelated subset data within each worker are then fitted via any algorithm designed for high-dimensional problems. We show that by incorporating the decorrelation step, DECO can achieve consistent variable selection and parameter estimation on each subset with (almost) no assumptions. In addition, the convergence rate is nearly minimax optimal for both sparse and weakly sparse models and does NOT depend on the partition number m. Extensive numerical experiments are provided to illustrate the performance of the new framework.
For datasets with both large sample sizes and high dimensionality, I propose a new "divided-and-conquer" framework {\em DEME} (DECO-message) by leveraging both the {\em DECO} and the {\em message} algorithm. The new framework first partitions the dataset in the sample space into row cubes using {\em message} and then partition the feature space of the cubes using {\em DECO}. This procedure is equivalent to partitioning the original data matrix into multiple small blocks, each with a feasible size that can be stored and fitted in a computer in parallel. The results are then synthezied via the {\em DECO} and {\em message} algorithm in a reverse order to produce the final output. The whole framework is extremely scalable.
Resumo:
To maintain the pace of development set by Moore's law, production processes in semiconductor manufacturing are becoming more and more complex. The development of efficient and interpretable anomaly detection systems is fundamental to keeping production costs low. As the dimension of process monitoring data can become extremely high anomaly detection systems are impacted by the curse of dimensionality, hence dimensionality reduction plays an important role. Classical dimensionality reduction approaches, such as Principal Component Analysis, generally involve transformations that seek to maximize the explained variance. In datasets with several clusters of correlated variables the contributions of isolated variables to explained variance may be insignificant, with the result that they may not be included in the reduced data representation. It is then not possible to detect an anomaly if it is only reflected in such isolated variables. In this paper we present a new dimensionality reduction technique that takes account of such isolated variables and demonstrate how it can be used to build an interpretable and robust anomaly detection system for Optical Emission Spectroscopy data.
Resumo:
This paper presents a case-based heuristic selection approach for automated university course and exam timetabling. The method described in this paper is motivated by the goal of developing timetabling systems that are fundamentally more general than the current state of the art. Heuristics that worked well in previous similar situations are memorized in a case base and are retrieved for solving the problem in hand. Knowledge discovery techniques are employed in two distinct scenarios. Firstly, we model the problem and the problem solving situations along with specific heuristics for those problems. Secondly, we refine the case base and discard cases which prove to be non-useful in solving new problems. Experimental results are presented and analyzed. It is shown that case based reasoning can act effectively as an intelligent approach to learn which heuristics work well for particular timetabling situations. We conclude by outlining and discussing potential research issues in this critical area of knowledge discovery for different difficult timetabling problems.
Resumo:
This paper presents a case-based heuristic selection approach for automated university course and exam timetabling. The method described in this paper is motivated by the goal of developing timetabling systems that are fundamentally more general than the current state of the art. Heuristics that worked well in previous similar situations are memorized in a case base and are retrieved for solving the problem in hand. Knowledge discovery techniques are employed in two distinct scenarios. Firstly, we model the problem and the problem solving situations along with specific heuristics for those problems. Secondly, we refine the case base and discard cases which prove to be non-useful in solving new problems. Experimental results are presented and analyzed. It is shown that case based reasoning can act effectively as an intelligent approach to learn which heuristics work well for particular timetabling situations. We conclude by outlining and discussing potential research issues in this critical area of knowledge discovery for different difficult timetabling problems.
Resumo:
Hazardous materials are substances that, if not regulated, can pose a threat to human populations and their environmental health, safety or property when transported in commerce. About 1.5 million tons of hazardous material shipments are transported by truck in the US annually, with a steady increase of approximately 5% per year. The objective of this study was to develop a routing tool for hazardous material transport in order to facilitate reduced environmental impacts and less transportation difficulties, yet would also find paths that were still compelling for the shipping carriers as a matter of trucking cost. The study started with identification of inhalation hazard impact zones and explosion protective areas around the location of hypothetical hazardous material releases, considering different parameters (i.e., chemicals characteristics, release quantities, atmospheric condition, etc.). Results showed that depending on the quantity of release, chemical, and atmospheric stability (a function of wind speed, meteorology, sky cover, time and location of accidents, etc.) the consequence of these incidents can differ. The study was extended by selection of other evaluation criteria for further investigation because health risk as an evaluation criterion would not be the only concern in selection of routes. Transportation difficulties (i.e., road blockage and congestion) were incorporated as important factor due to their indirect impact/cost on the users of transportation networks. Trucking costs were also considered as one of the primary criteria in selection of hazardous material paths; otherwise the suggested routes would have not been convincing for the shipping companies. The last but not least criterion was proximity of public places to the routes. The approach evolved from a simple framework to a complicated and efficient GIS-based tool able to investigate transportation networks of any given study area, and capable of generating best routing options for cargos. The suggested tool uses a multi-criteria-decision-making method, which considers the priorities of the decision makers in choosing the cargo routes. Comparison of the routing options based on each criterion and also the overall suitableness of the path in regards to all the criteria (using a multi-criteria-decision-making method) showed that using similar tools as the one proposed by this study can provide decision makers insights in the area of hazardous material transport. This tool shows the probable consequences of considering each path in a very easily understandable way; in the formats of maps and tables, which makes the tradeoffs of costs and risks considerably simpler, as in some cases slightly compromising on trucking cost may drastically decrease the probable health risk and/or traffic difficulties. This will not only be rewarding to the community by making cities safer places to live, but also can be beneficial to shipping companies by allowing them to advertise as environmental friendly conveyors.
Resumo:
This paper proposes a process for the classifi cation of new residential electricity customers. The current state of the art is extended by using a combination of smart metering and survey data and by using model-based feature selection for the classifi cation task. Firstly, the normalized representative consumption profi les of the population are derived through the clustering of data from households. Secondly, new customers are classifi ed using survey data and a limited amount of smart metering data. Thirdly, regression analysis and model-based feature selection results explain the importance of the variables and which are the drivers of diff erent consumption profi les, enabling the extraction of appropriate models. The results of a case study show that the use of survey data signi ficantly increases accuracy of the classifi cation task (up to 20%). Considering four consumption groups, more than half of the customers are correctly classifi ed with only one week of metering data, with more weeks the accuracy is signifi cantly improved. The use of model-based feature selection resulted in the use of a signifi cantly lower number of features allowing an easy interpretation of the derived models.
Resumo:
One of the major aims of BCI research is devoted to achieving faster and more efficient control of external devices. The identification of individual tap events in a motor imagery BCI is therefore a desirable goal. EEG is recorded from subjects performing and imagining finger taps with their left and right hands. A Differential Evolution based feature selection wrapper is used in order to identify optimal features in the spatial and frequency domains for tap identification. Channel-frequency band combinations are found which allow differentiation of tap vs. no-tap control conditions for executed and imagined taps. Left vs. right hand taps may also be differentiated with features found in this manner. A sliding time window is then used to accurately identify individual taps in the executed tap and imagined tap conditions. Highly statistically significant classification accuracies are achieved with time windows of 0.5 s and more allowing taps to be identified on a single trial basis.
Resumo:
Aim of this paper is to evaluate the diagnostic contribution of various types of texture features in discrimination of hepatic tissue in abdominal non-enhanced Computed Tomography (CT) images. Regions of Interest (ROIs) corresponding to the classes: normal liver, cyst, hemangioma, and hepatocellular carcinoma were drawn by an experienced radiologist. For each ROI, five distinct sets of texture features are extracted using First Order Statistics (FOS), Spatial Gray Level Dependence Matrix (SGLDM), Gray Level Difference Method (GLDM), Laws' Texture Energy Measures (TEM), and Fractal Dimension Measurements (FDM). In order to evaluate the ability of the texture features to discriminate the various types of hepatic tissue, each set of texture features, or its reduced version after genetic algorithm based feature selection, was fed to a feed-forward Neural Network (NN) classifier. For each NN, the area under Receiver Operating Characteristic (ROC) curves (Az) was calculated for all one-vs-all discriminations of hepatic tissue. Additionally, the total Az for the multi-class discrimination task was estimated. The results show that features derived from FOS perform better than other texture features (total Az: 0.802+/-0.083) in the discrimination of hepatic tissue.
Resumo:
Piotr Omenzetter and Simon Hoell’s work within the Lloyd’s Register Foundation Centre for Safety and Reliability Engineering at the University of Aberdeen is supported by Lloyd’s Register Foundation. The Foundation helps to protect life and property by supporting engineering-related education, public engagement and the application of research.
Resumo:
Piotr Omenzetter and Simon Hoell’s work within the Lloyd’s Register Foundation Centre for Safety and Reliability Engineering at the University of Aberdeen is supported by Lloyd’s Register Foundation. The Foundation helps to protect life and property by supporting engineering-related education, public engagement and the application of research.