904 resultados para Feature discretization
Resumo:
Discrete data representations are necessary, or at least convenient, in many machine learning problems. While feature selection (FS) techniques aim at finding relevant subsets of features, the goal of feature discretization (FD) is to find concise (quantized) data representations, adequate for the learning task at hand. In this paper, we propose two incremental methods for FD. The first method belongs to the filter family, in which the quality of the discretization is assessed by a (supervised or unsupervised) relevance criterion. The second method is a wrapper, where discretized features are assessed using a classifier. Both methods can be coupled with any static (unsupervised or supervised) discretization procedure and can be used to perform FS as pre-processing or post-processing stages. The proposed methods attain efficient representations suitable for binary and multi-class problems with different types of data, being competitive with existing methods. Moreover, using well-known FS methods with the features discretized by our techniques leads to better accuracy than with the features discretized by other methods or with the original features. (C) 2013 Elsevier B.V. All rights reserved.
Resumo:
Many learning problems require handling high dimensional datasets with a relatively small number of instances. Learning algorithms are thus confronted with the curse of dimensionality, and need to address it in order to be effective. Examples of these types of data include the bag-of-words representation in text classification problems and gene expression data for tumor detection/classification. Usually, among the high number of features characterizing the instances, many may be irrelevant (or even detrimental) for the learning tasks. It is thus clear that there is a need for adequate techniques for feature representation, reduction, and selection, to improve both the classification accuracy and the memory requirements. In this paper, we propose combined unsupervised feature discretization and feature selection techniques, suitable for medium and high-dimensional datasets. The experimental results on several standard datasets, with both sparse and dense features, show the efficiency of the proposed techniques as well as improvements over previous related techniques.
Resumo:
Feature discretization (FD) techniques often yield adequate and compact representations of the data, suitable for machine learning and pattern recognition problems. These representations usually decrease the training time, yielding higher classification accuracy while allowing for humans to better understand and visualize the data, as compared to the use of the original features. This paper proposes two new FD techniques. The first one is based on the well-known Linde-Buzo-Gray quantization algorithm, coupled with a relevance criterion, being able perform unsupervised, supervised, or semi-supervised discretization. The second technique works in supervised mode, being based on the maximization of the mutual information between each discrete feature and the class label. Our experimental results on standard benchmark datasets show that these techniques scale up to high-dimensional data, attaining in many cases better accuracy than existing unsupervised and supervised FD approaches, while using fewer discretization intervals.
Resumo:
In machine learning and pattern recognition tasks, the use of feature discretization techniques may have several advantages. The discretized features may hold enough information for the learning task at hand, while ignoring minor fluctuations that are irrelevant or harmful for that task. The discretized features have more compact representations that may yield both better accuracy and lower training time, as compared to the use of the original features. However, in many cases, mainly with medium and high-dimensional data, the large number of features usually implies that there is some redundancy among them. Thus, we may further apply feature selection (FS) techniques on the discrete data, keeping the most relevant features, while discarding the irrelevant and redundant ones. In this paper, we propose relevance and redundancy criteria for supervised feature selection techniques on discrete data. These criteria are applied to the bin-class histograms of the discrete features. The experimental results, on public benchmark data, show that the proposed criteria can achieve better accuracy than widely used relevance and redundancy criteria, such as mutual information and the Fisher ratio.
Resumo:
Las organizaciones y sus entornos son sistemas complejos. Tales sistemas son difíciles de comprender y predecir. Pese a ello, la predicción es una tarea fundamental para la gestión empresarial y para la toma de decisiones que implica siempre un riesgo. Los métodos clásicos de predicción (entre los cuales están: la regresión lineal, la Autoregresive Moving Average y el exponential smoothing) establecen supuestos como la linealidad, la estabilidad para ser matemática y computacionalmente tratables. Por diferentes medios, sin embargo, se han demostrado las limitaciones de tales métodos. Pues bien, en las últimas décadas nuevos métodos de predicción han surgido con el fin de abarcar la complejidad de los sistemas organizacionales y sus entornos, antes que evitarla. Entre ellos, los más promisorios son los métodos de predicción bio-inspirados (ej. redes neuronales, algoritmos genéticos /evolutivos y sistemas inmunes artificiales). Este artículo pretende establecer un estado situacional de las aplicaciones actuales y potenciales de los métodos bio-inspirados de predicción en la administración.
Resumo:
In this thesis a mathematical model was derived that describes the charge and energy transport in semiconductor devices like transistors. Moreover, numerical simulations of these physical processes are performed. In order to accomplish this, methods of theoretical physics, functional analysis, numerical mathematics and computer programming are applied. After an introduction to the status quo of semiconductor device simulation methods and a brief review of historical facts up to now, the attention is shifted to the construction of a model, which serves as the basis of the subsequent derivations in the thesis. Thereby the starting point is an important equation of the theory of dilute gases. From this equation the model equations are derived and specified by means of a series expansion method. This is done in a multi-stage derivation process, which is mainly taken from a scientific paper and which does not constitute the focus of this thesis. In the following phase we specify the mathematical setting and make precise the model assumptions. Thereby we make use of methods of functional analysis. Since the equations we deal with are coupled, we are concerned with a nonstandard problem. In contrary, the theory of scalar elliptic equations is established meanwhile. Subsequently, we are preoccupied with the numerical discretization of the equations. A special finite-element method is used for the discretization. This special approach has to be done in order to make the numerical results appropriate for practical application. By a series of transformations from the discrete model we derive a system of algebraic equations that are eligible for numerical evaluation. Using self-made computer programs we solve the equations to get approximate solutions. These programs are based on new and specialized iteration procedures that are developed and thoroughly tested within the frame of this research work. Due to their importance and their novel status, they are explained and demonstrated in detail. We compare these new iterations with a standard method that is complemented by a feature to fit in the current context. A further innovation is the computation of solutions in three-dimensional domains, which are still rare. Special attention is paid to applicability of the 3D simulation tools. The programs are designed to have justifiable working complexity. The simulation results of some models of contemporary semiconductor devices are shown and detailed comments on the results are given. Eventually, we make a prospect on future development and enhancements of the models and of the algorithms that we used.
Resumo:
Background: Feature selection is a pattern recognition approach to choose important variables according to some criteria in order to distinguish or explain certain phenomena (i.e., for dimensionality reduction). There are many genomic and proteomic applications that rely on feature selection to answer questions such as selecting signature genes which are informative about some biological state, e. g., normal tissues and several types of cancer; or inferring a prediction network among elements such as genes, proteins and external stimuli. In these applications, a recurrent problem is the lack of samples to perform an adequate estimate of the joint probabilities between element states. A myriad of feature selection algorithms and criterion functions have been proposed, although it is difficult to point the best solution for each application. Results: The intent of this work is to provide an open-source multiplataform graphical environment for bioinformatics problems, which supports many feature selection algorithms, criterion functions and graphic visualization tools such as scatterplots, parallel coordinates and graphs. A feature selection approach for growing genetic networks from seed genes ( targets or predictors) is also implemented in the system. Conclusion: The proposed feature selection environment allows data analysis using several algorithms, criterion functions and graphic visualization tools. Our experiments have shown the software effectiveness in two distinct types of biological problems. Besides, the environment can be used in different pattern recognition applications, although the main concern regards bioinformatics tasks.
Resumo:
Morbid obesity is highly prevalent in the Western World, and its consequences present a real public health challenge. Voice alterations can represent one of these consequences and represent an opportunity for interference with therapeutic methods. This particularly features of the individual`s voice was the goal of the present study. A group of 45 adult volunteers of both sexes with a BMI greater than 35 Kg/m(2) was selected among patients of the Obesity Ambulatory of the Digestive Surgery Division. The control group consisted of volunteers matched by sex, age (+/- 1 year), and smoking habits, but with a BMI bellow 30 Kg/m(2). All subjects were submitted to laryngoscopic examination, audio perceptive analysis, and voice acoustics determination. Examinations were always performed by the same doctor, and diagnoses were provided by two different physician specialists in laryngology and voice. Obese individuals exhibit the following modifications in voice feature: hoarseness, murmuring, vocal instability, altered jitter and shimmer, and reduced maximum phonation times as well the presence of voice strangulation at the end of emission. The voices of individuals with morbid obesity are different of the voice of nonobese people and demonstrate significant changes in vocal characteristics.
Resumo:
Aims: Claudins, a large family of essential tight junction (TJ) proteins, are abnormally regulated in human carcinomas, especially claudin-7. The aim of this study was to investigate claudin-7 expression and alterations in oral squamous cell carcinoma (OSCC). Methods and results: Expression of claudin-7 was analysed in 132 cases of OSCC organized in a tissue microarray. Claudin-7 mRNA transcript was evaluated using real-time polymerase chain reaction and the methylation status of the promoter was also assessed. Claudin-7 was negative in 58.3% of the cases. Loss of claudin-7 protein expression was associated with recurrence (P = 0.019), tumour size (P = 0.014), clinical stage of OSCC (P = 0.055) and disease-free survival (P = 0.015). Down-regulation of the claudin-7 mRNA transcripts was observed in 78% of the cases, in accordance with immunoexpression. Analysis of the methylation status of the promoter region of claudin-7 revealed that treatment of O28 cells (that did not express claudin-7 mRNA transcripts) with 5-Aza-2`-Deoxycytidine (5-Aza-dC) led to the re-expression of claudin-7 mRNA transcript. Conclusion: Loss of claudin-7 expression is associated with important subcellular processes in OSCC with impact on clinical parameters.
Resumo:
In this work, we take advantage of association rule mining to support two types of medical systems: the Content-based Image Retrieval (CBIR) systems and the Computer-Aided Diagnosis (CAD) systems. For content-based retrieval, association rules are employed to reduce the dimensionality of the feature vectors that represent the images and to improve the precision of the similarity queries. We refer to the association rule-based method to improve CBIR systems proposed here as Feature selection through Association Rules (FAR). To improve CAD systems, we propose the Image Diagnosis Enhancement through Association rules (IDEA) method. Association rules are employed to suggest a second opinion to the radiologist or a preliminary diagnosis of a new image. A second opinion automatically obtained can either accelerate the process of diagnosing or to strengthen a hypothesis, increasing the probability of a prescribed treatment be successful. Two new algorithms are proposed to support the IDEA method: to pre-process low-level features and to propose a preliminary diagnosis based on association rules. We performed several experiments to validate the proposed methods. The results indicate that association rules can be successfully applied to improve CBIR and CAD systems, empowering the arsenal of techniques to support medical image analysis in medical systems. (C) 2009 Elsevier B.V. All rights reserved.
Resumo:
In this paper, we propose a method based on association rule-mining to enhance the diagnosis of medical images (mammograms). It combines low-level features automatically extracted from images and high-level knowledge from specialists to search for patterns. Our method analyzes medical images and automatically generates suggestions of diagnoses employing mining of association rules. The suggestions of diagnosis are used to accelerate the image analysis performed by specialists as well as to provide them an alternative to work on. The proposed method uses two new algorithms, PreSAGe and HiCARe. The PreSAGe algorithm combines, in a single step, feature selection and discretization, and reduces the mining complexity. Experiments performed on PreSAGe show that this algorithm is highly suitable to perform feature selection and discretization in medical images. HiCARe is a new associative classifier. The HiCARe algorithm has an important property that makes it unique: it assigns multiple keywords per image to suggest a diagnosis with high values of accuracy. Our method was applied to real datasets, and the results show high sensitivity (up to 95%) and accuracy (up to 92%), allowing us to claim that the use of association rules is a powerful means to assist in the diagnosing task.
Resumo:
Feature selection is one of important and frequently used techniques in data preprocessing. It can improve the efficiency and the effectiveness of data mining by reducing the dimensions of feature space and removing the irrelevant and redundant information. Feature selection can be viewed as a global optimization problem of finding a minimum set of M relevant features that describes the dataset as well as the original N attributes. In this paper, we apply the adaptive partitioned random search strategy into our feature selection algorithm. Under this search strategy, the partition structure and evaluation function is proposed for feature selection problem. This algorithm ensures the global optimal solution in theory and avoids complete randomness in search direction. The good property of our algorithm is shown through the theoretical analysis.