31 resultados para Feature selection algorithm

em Universidad Politécnica de Madrid


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Automatic blood glucose classification may help specialists to provide a better interpretation of blood glucose data, downloaded directly from patients glucose meter and will contribute in the development of decision support systems for gestational diabetes. This paper presents an automatic blood glucose classifier for gestational diabetes that compares 6 different feature selection methods for two machine learning algorithms: neural networks and decision trees. Three searching algorithms, Greedy, Best First and Genetic, were combined with two different evaluators, CSF and Wrapper, for the feature selection. The study has been made with 6080 blood glucose measurements from 25 patients. Decision trees with a feature set selected with the Wrapper evaluator and the Best first search algorithm obtained the best accuracy: 95.92%.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This research proposes a generic methodology for dimensionality reduction upon time-frequency representations applied to the classification of different types of biosignals. The methodology directly deals with the highly redundant and irrelevant data contained in these representations, combining a first stage of irrelevant data removal by variable selection, with a second stage of redundancy reduction using methods based on linear transformations. The study addresses two techniques that provided a similar performance: the first one is based on the selection of a set of the most relevant time?frequency points, whereas the second one selects the most relevant frequency bands. The first methodology needs a lower quantity of components, leading to a lower feature space; but the second improves the capture of the time-varying dynamics of the signal, and therefore provides a more stable performance. In order to evaluate the generalization capabilities of the methodology proposed it has been applied to two types of biosignals with different kinds of non-stationary behaviors: electroencephalographic and phonocardiographic biosignals. Even when these two databases contain samples with different degrees of complexity and a wide variety of characterizing patterns, the results demonstrate a good accuracy for the detection of pathologies, over 98%.The results open the possibility to extrapolate the methodology to the study of other biosignals.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Complex networks have been extensively used in the last decade to characterize and analyze complex systems, and they have been recently proposed as a novel instrument for the analysis of spectra extracted from biological samples. Yet, the high number of measurements composing spectra, and the consequent high computational cost, make a direct network analysis unfeasible. We here present a comparative analysis of three customary feature selection algorithms, including the binning of spectral data and the use of information theory metrics. Such algorithms are compared by assessing the score obtained in a classification task, where healthy subjects and people suffering from different types of cancers should be discriminated. Results indicate that a feature selection strategy based on Mutual Information outperforms the more classical data binning, while allowing a reduction of the dimensionality of the data set in two orders of magnitude

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In the last decade, the research community has focused on new classification methods that rely on statistical characteristics of Internet traffic, instead of pre-viously popular port-number-based or payload-based methods, which are under even bigger constrictions. Some research works based on statistical characteristics generated large fea-ture sets of Internet traffic; however, nowadays it?s impossible to handle hun-dreds of features in big data scenarios, only leading to unacceptable processing time and misleading classification results due to redundant and correlative data. As a consequence, a feature selection procedure is essential in the process of Internet traffic characterization. In this paper a survey of feature selection methods is presented: feature selection frameworks are introduced, and differ-ent categories of methods are briefly explained and compared; several proposals on feature selection in Internet traffic characterization are shown; finally, future application of feature selection to a concrete project is proposed.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Most data stream classification techniques assume that the underlying feature space is static. However, in real-world applications the set of features and their relevance to the target concept may change over time. In addition, when the underlying concepts reappear, reusing previously learnt models can enhance the learning process in terms of accuracy and processing time at the expense of manageable memory consumption. In this paper, we propose mining recurring concepts in a dynamic feature space (MReC-DFS), a data stream classification system to address the challenges of learning recurring concepts in a dynamic feature space while simultaneously reducing the memory cost associated with storing past models. MReC-DFS is able to detect and adapt to concept changes using the performance of the learning process and contextual information. To handle recurring concepts, stored models are combined in a dynamically weighted ensemble. Incremental feature selection is performed to reduce the combined feature space. This contribution allows MReC-DFS to store only the features most relevant to the learnt concepts, which in turn increases the memory efficiency of the technique. In addition, an incremental feature selection method is proposed that dynamically determines the threshold between relevant and irrelevant features. Experimental results demonstrating the high accuracy of MReC-DFS compared with state-of-the-art techniques on a variety of real datasets are presented. The results also show the superior memory efficiency of MReC-DFS.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

On-line partial discharge (PD) measurements have become a common technique for assessing the insulation condition of installed high voltage (HV) insulated cables. When on-line tests are performed in noisy environments, or when more than one source of pulse-shaped signals are present in a cable system, it is difficult to perform accurate diagnoses. In these cases, an adequate selection of the non-conventional measuring technique and the implementation of effective signal processing tools are essential for a correct evaluation of the insulation degradation. Once a specific noise rejection filter is applied, many signals can be identified as potential PD pulses, therefore, a classification tool to discriminate the PD sources involved is required. This paper proposes an efficient method for the classification of PD signals and pulse-type noise interferences measured in power cables with HFCT sensors. By using a signal feature generation algorithm, representative parameters associated to the waveform of each pulse acquired are calculated so that they can be separated in different clusters. The efficiency of the clustering technique proposed is demonstrated through an example with three different PD sources and several pulse-shaped interferences measured simultaneously in a cable system with a high frequency current transformer (HFCT).

Relevância:

80.00% 80.00%

Publicador:

Resumo:

The main purpose of a gene interaction network is to map the relationships of the genes that are out of sight when a genomic study is tackled. DNA microarrays allow the measure of gene expression of thousands of genes at the same time. These data constitute the numeric seed for the induction of the gene networks. In this paper, we propose a new approach to build gene networks by means of Bayesian classifiers, variable selection and bootstrap resampling. The interactions induced by the Bayesian classifiers are based both on the expression levels and on the phenotype information of the supervised variable. Feature selection and bootstrap resampling add reliability and robustness to the overall process removing the false positive findings. The consensus among all the induced models produces a hierarchy of dependences and, thus, of variables. Biologists can define the depth level of the model hierarchy so the set of interactions and genes involved can vary from a sparse to a dense set. Experimental results show how these networks perform well on classification tasks. The biological validation matches previous biological findings and opens new hypothesis for future studies

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Many diseases have a genetic origin, and a great effort is being made to detect the genes that are responsible for their insurgence. One of the most promising techniques is the analysis of genetic information through the use of complex networks theory. Yet, a practical problem of this approach is its computational cost, which scales as the square of the number of features included in the initial dataset. In this paper, we propose the use of an iterative feature selection strategy to identify reduced subsets of relevant features, and show an application to the analysis of congenital Obstructive Nephropathy. Results demonstrate that, besides achieving a drastic reduction of the computational cost, the topologies of the obtained networks still hold all the relevant information, and are thus able to fully characterize the severity of the disease.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

El presente proyecto tiene el objetivo de facilitar la composición de canciones mediante la creación de las distintas pistas MIDI que la forman. Se implementan dos controladores. El primero, con objeto de transcribir la parte melódica, convierte la voz cantada o tarareada a eventos MIDI. Para ello, y tras el estudio de las distintas técnicas del cálculo del tono (pitch), se implementará una técnica con ciertas variaciones basada en la autocorrelación. También se profundiza en el segmentado de eventos, en particular, una técnica basada en el análisis de la derivada de la envolvente. El segundo, dedicado a la base rítmica de la canción, permite la creación de la percusión mediante el golpe rítmico de objetos que disponga el usuario, que serán asignados a los distintos elementos de percusión elegidos. Los resultados de la grabación de estos impactos serán señales de corta duración, no lineales y no armónicas, dificultando su discriminación. La herramienta elegida para la clasificación de los distintos patrones serán las redes neuronales artificiales (RNA). Se realizara un estudio de la metodología de diseño de redes neuronales especifico para este tipo de señales, evaluando la importancia de las variables de diseño como son el número de capas ocultas y neuronas en cada una de ellas, algoritmo de entrenamiento y funciones de activación. El estudio concluirá con la implementación de dos redes de diferente naturaleza. Una red de Elman, cuyas propiedades de memoria permiten la clasificación de patrones temporales, procesará las cualidades temporales analizando el ataque de su forma de onda. Una red de propagación hacia adelante feed-forward, que necesitará de robustas características espectrales y temporales para su clasificación. Se proponen 26 descriptores como los derivados de los momentos del espectro: centroide, curtosis y simetría, los coeficientes cepstrales de la escala de Mel (MFCCs), y algunos temporales como son la tasa de cruces por cero y el centroide de la envolvente temporal. Las capacidades de discriminación inter e intra clase de estas características serán evaluadas mediante un algoritmo de selección, habiéndose elegido RELIEF, un método basado en el algoritmo de los k vecinos mas próximos (KNN). Ambos controladores tendrán función de trabajar en tiempo real y offline, permitiendo tanto la composición de canciones, como su utilización como un instrumento más junto con mas músicos. ABSTRACT. The aim of this project is to make song composition easier by creating each MIDI track that builds it. Two controllers are implemented. In order to transcribe the melody, the first controler converts singing voice or humming into MIDI files. To do this a technique based on autocorrelation is implemented after having studied different pitch detection methods. Event segmentation has also been dealt with, to be more precise a technique based on the analysis of the signal's envelope and it's derivative have been used. The second one, can be used to make the song's rhythm . It allows the user, to create percussive patterns by hitting different objects of his environment. These recordings results in short duration, non-linear and non-harmonic signals. Which makes the classification process more complicated in the traditional way. The tools to used are the artificial neural networks (ANN). We will study the neural network design to deal with this kind of signals. The goal is to get a design methodology, paying attention to the variables involved, as the number of hidden layers and neurons in each, transfer functions and training algorithm. The study will end implementing two neural networks with different nature. Elman network, which has memory properties, is capable to recognize sequences of data and analyse the impact's waveform, precisely, the attack portion. A feed-forward network, needs strong spectral and temporal features extracted from the hit. Some descriptors are proposed as the derivates from the spectrum moment as centroid, kurtosis and skewness, the Mel-frequency cepstral coefficients, and some temporal features as the zero crossing rate (zcr) and the temporal envelope's centroid. Intra and inter class discrimination abilities of those descriptors will be weighted using the selection algorithm RELIEF, a Knn (K-nearest neighbor) based algorithm. Both MIDI controllers can be used to compose, or play with other musicians as it works on real-time and offline.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

In this paper, we analyze the performance of several well-known pattern recognition and dimensionality reduction techniques when applied to mass-spectrometry data for odor biometric identification. Motivated by the successful results of previous works capturing the odor from other parts of the body, this work attempts to evaluate the feasibility of identifying people by the odor emanated from the hands. By formulating this task according to a machine learning scheme, the problem is identified with a small-sample-size supervised classification problem in which the input data is formed by mass spectrograms from the hand odor of 13 subjects captured in different sessions. The high dimensionality of the data makes it necessary to apply feature selection and extraction techniques together with a simple classifier in order to improve the generalization capabilities of the model. Our experimental results achieve recognition rates over 85% which reveals that there exists discriminatory information in the hand odor and points at body odor as a promising biometric identifier.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Nonlinear analysis tools for studying and characterizing the dynamics of physiological signals have gained popularity, mainly because tracking sudden alterations of the inherent complexity of biological processes might be an indicator of altered physiological states. Typically, in order to perform an analysis with such tools, the physiological variables that describe the biological process under study are used to reconstruct the underlying dynamics of the biological processes. For that goal, a procedure called time-delay or uniform embedding is usually employed. Nonetheless, there is evidence of its inability for dealing with non-stationary signals, as those recorded from many physiological processes. To handle with such a drawback, this paper evaluates the utility of non-conventional time series reconstruction procedures based on non uniform embedding, applying them to automatic pattern recognition tasks. The paper compares a state of the art non uniform approach with a novel scheme which fuses embedding and feature selection at once, searching for better reconstructions of the dynamics of the system. Moreover, results are also compared with two classic uniform embedding techniques. Thus, the goal is comparing uniform and non uniform reconstruction techniques, including the one proposed in this work, for pattern recognition in biomedical signal processing tasks. Once the state space is reconstructed, the scheme followed characterizes with three classic nonlinear dynamic features (Largest Lyapunov Exponent, Correlation Dimension and Recurrence Period Density Entropy), while classification is carried out by means of a simple k-nn classifier. In order to test its generalization capabilities, the approach was tested with three different physiological databases (Speech Pathologies, Epilepsy and Heart Murmurs). In terms of the accuracy obtained to automatically detect the presence of pathologies, and for the three types of biosignals analyzed, the non uniform techniques used in this work lightly outperformed the results obtained using the uniform methods, suggesting their usefulness to characterize non-stationary biomedical signals in pattern recognition applications. On the other hand, in view of the results obtained and its low computational load, the proposed technique suggests its applicability for the applications under study.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

This paper studies feature subset selection in classification using a multiobjective estimation of distribution algorithm. We consider six functions, namely area under ROC curve, sensitivity, specificity, precision, F1 measure and Brier score, for evaluation of feature subsets and as the objectives of the problem. One of the characteristics of these objective functions is the existence of noise in their values that should be appropriately handled during optimization. Our proposed algorithm consists of two major techniques which are specially designed for the feature subset selection problem. The first one is a solution ranking method based on interval values to handle the noise in the objectives of this problem. The second one is a model estimation method for learning a joint probabilistic model of objectives and variables which is used to generate new solutions and advance through the search space. To simplify model estimation, l1 regularized regression is used to select a subset of problem variables before model learning. The proposed algorithm is compared with a well-known ranking method for interval-valued objectives and a standard multiobjective genetic algorithm. Particularly, the effects of the two new techniques are experimentally investigated. The experimental results show that the proposed algorithm is able to obtain comparable or better performance on the tested datasets.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Mass spectrometry (MS) data provide a promising strategy for biomarker discovery. For this purpose, the detection of relevant peakbins in MS data is currently under intense research. Data from mass spectrometry are challenging to analyze because of their high dimensionality and the generally low number of samples available. To tackle this problem, the scientific community is becoming increasingly interested in applying feature subset selection techniques based on specialized machine learning algorithms. In this paper, we present a performance comparison of some metaheuristics: best first (BF), genetic algorithm (GA), scatter search (SS) and variable neighborhood search (VNS). Up to now, all the algorithms, except for GA, have been first applied to detect relevant peakbins in MS data. All these metaheuristic searches are embedded in two different filter and wrapper schemes coupled with Naive Bayes and SVM classifiers.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The Self-OrganizingMap (SOM) is a neural network model that performs an ordered projection of a high dimensional input space in a low-dimensional topological structure. The process in which such mapping is formed is defined by the SOM algorithm, which is a competitive, unsupervised and nonparametric method, since it does not make any assumption about the input data distribution. The feature maps provided by this algorithm have been successfully applied for vector quantization, clustering and high dimensional data visualization processes. However, the initialization of the network topology and the selection of the SOM training parameters are two difficult tasks caused by the unknown distribution of the input signals. A misconfiguration of these parameters can generate a feature map of low-quality, so it is necessary to have some measure of the degree of adaptation of the SOM network to the input data model. The topologypreservation is the most common concept used to implement this measure. Several qualitative and quantitative methods have been proposed for measuring the degree of SOM topologypreservation, particularly using Kohonen's model. In this work, two methods for measuring the topologypreservation of the Growing Cell Structures (GCSs) model are proposed: the topographic function and the topology preserving map

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This paper proposes a method for the identification of different partial discharges (PDs) sources through the analysis of a collection of PD signals acquired with a PD measurement system. This method, robust and sensitive enough to cope with noisy data and external interferences, combines the characterization of each signal from the collection, with a clustering procedure, the CLARA algorithm. Several features are proposed for the characterization of the signals, being the wavelet variances, the frequency estimated with the Prony method, and the energy, the most relevant for the performance of the clustering procedure. The result of the unsupervised classification is a set of clusters each containing those signals which are more similar to each other than to those in other clusters. The analysis of the classification results permits both the identification of different PD sources and the discrimination between original PD signals, reflections, noise and external interferences. The methods and graphical tools detailed in this paper have been coded and published as a contributed package of the R environment under a GNU/GPL license.