927 resultados para Customer feature selection


Relevância:

80.00% 80.00%

Publicador:

Resumo:

The present work belongs to the PRANA project, the first extensive field campaign of observation of atmospheric emission spectra covering the Far InfraRed spectral region, for more than two years. The principal deployed instrument is REFIR-PAD, a Fourier transform spectrometer used by us to study Antarctic cloud properties. A dataset covering the whole 2013 has been analyzed and, firstly, a selection of good quality spectra is performed, using, as thresholds, radiance values in few chosen spectral regions. These spectra are described in a synthetic way averaging radiances in selected intervals, converting them into BTs and finally considering the differences between each pair of them. A supervised feature selection algorithm is implemented with the purpose to select the features really informative about the presence, the phase and the type of cloud. Hence, training and test sets are collected, by means of Lidar quick-looks. The supervised classification step of the overall monthly datasets is performed using a SVM. On the base of this classification and with the help of Lidar observations, 29 non-precipitating ice cloud case studies are selected. A single spectrum, or at most an average over two or three spectra, is processed by means of the retrieval algorithm RT-RET, exploiting some main IR window channels, in order to extract cloud properties. Retrieved effective radii and optical depths are analyzed, to compare them with literature studies and to evaluate possible seasonal trends. Finally, retrieval output atmospheric profiles are used as inputs for simulations, assuming two different crystal habits, with the aim to examine our ability to reproduce radiances in the FIR. Substantial mis-estimations are found for FIR micro-windows: a high variability is observed in the spectral pattern of simulation deviations from measured spectra and an effort to link these deviations to cloud parameters has been performed.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Nowadays communication is switching from a centralized scenario, where communication media like newspapers, radio, TV programs produce information and people are just consumers, to a completely different decentralized scenario, where everyone is potentially an information producer through the use of social networks, blogs, forums that allow a real-time worldwide information exchange. These new instruments, as a result of their widespread diffusion, have started playing an important socio-economic role. They are the most used communication media and, as a consequence, they constitute the main source of information enterprises, political parties and other organizations can rely on. Analyzing data stored in servers all over the world is feasible by means of Text Mining techniques like Sentiment Analysis, which aims to extract opinions from huge amount of unstructured texts. This could lead to determine, for instance, the user satisfaction degree about products, services, politicians and so on. In this context, this dissertation presents new Document Sentiment Classification methods based on the mathematical theory of Markov Chains. All these approaches bank on a Markov Chain based model, which is language independent and whose killing features are simplicity and generality, which make it interesting with respect to previous sophisticated techniques. Every discussed technique has been tested in both Single-Domain and Cross-Domain Sentiment Classification areas, comparing performance with those of other two previous works. The performed analysis shows that some of the examined algorithms produce results comparable with the best methods in literature, with reference to both single-domain and cross-domain tasks, in $2$-classes (i.e. positive and negative) Document Sentiment Classification. However, there is still room for improvement, because this work also shows the way to walk in order to enhance performance, that is, a good novel feature selection process would be enough to outperform the state of the art. Furthermore, since some of the proposed approaches show promising results in $2$-classes Single-Domain Sentiment Classification, another future work will regard validating these results also in tasks with more than $2$ classes.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Aim of this paper is to evaluate the diagnostic contribution of various types of texture features in discrimination of hepatic tissue in abdominal non-enhanced Computed Tomography (CT) images. Regions of Interest (ROIs) corresponding to the classes: normal liver, cyst, hemangioma, and hepatocellular carcinoma were drawn by an experienced radiologist. For each ROI, five distinct sets of texture features are extracted using First Order Statistics (FOS), Spatial Gray Level Dependence Matrix (SGLDM), Gray Level Difference Method (GLDM), Laws' Texture Energy Measures (TEM), and Fractal Dimension Measurements (FDM). In order to evaluate the ability of the texture features to discriminate the various types of hepatic tissue, each set of texture features, or its reduced version after genetic algorithm based feature selection, was fed to a feed-forward Neural Network (NN) classifier. For each NN, the area under Receiver Operating Characteristic (ROC) curves (Az) was calculated for all one-vs-all discriminations of hepatic tissue. Additionally, the total Az for the multi-class discrimination task was estimated. The results show that features derived from FOS perform better than other texture features (total Az: 0.802+/-0.083) in the discrimination of hepatic tissue.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

In this paper, a computer-aided diagnostic (CAD) system for the classification of hepatic lesions from computed tomography (CT) images is presented. Regions of interest (ROIs) taken from nonenhanced CT images of normal liver, hepatic cysts, hemangiomas, and hepatocellular carcinomas have been used as input to the system. The proposed system consists of two modules: the feature extraction and the classification modules. The feature extraction module calculates the average gray level and 48 texture characteristics, which are derived from the spatial gray-level co-occurrence matrices, obtained from the ROIs. The classifier module consists of three sequentially placed feed-forward neural networks (NNs). The first NN classifies into normal or pathological liver regions. The pathological liver regions are characterized by the second NN as cyst or "other disease." The third NN classifies "other disease" into hemangioma or hepatocellular carcinoma. Three feature selection techniques have been applied to each individual NN: the sequential forward selection, the sequential floating forward selection, and a genetic algorithm for feature selection. The comparative study of the above dimensionality reduction methods shows that genetic algorithms result in lower dimension feature vectors and improved classification performance.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

In population studies, most current methods focus on identifying one outcome-related SNP at a time by testing for differences of genotype frequencies between disease and healthy groups or among different population groups. However, testing a great number of SNPs simultaneously has a problem of multiple testing and will give false-positive results. Although, this problem can be effectively dealt with through several approaches such as Bonferroni correction, permutation testing and false discovery rates, patterns of the joint effects by several genes, each with weak effect, might not be able to be determined. With the availability of high-throughput genotyping technology, searching for multiple scattered SNPs over the whole genome and modeling their joint effect on the target variable has become possible. Exhaustive search of all SNP subsets is computationally infeasible for millions of SNPs in a genome-wide study. Several effective feature selection methods combined with classification functions have been proposed to search for an optimal SNP subset among big data sets where the number of feature SNPs far exceeds the number of observations. ^ In this study, we take two steps to achieve the goal. First we selected 1000 SNPs through an effective filter method and then we performed a feature selection wrapped around a classifier to identify an optimal SNP subset for predicting disease. And also we developed a novel classification method-sequential information bottleneck method wrapped inside different search algorithms to identify an optimal subset of SNPs for classifying the outcome variable. This new method was compared with the classical linear discriminant analysis in terms of classification performance. Finally, we performed chi-square test to look at the relationship between each SNP and disease from another point of view. ^ In general, our results show that filtering features using harmononic mean of sensitivity and specificity(HMSS) through linear discriminant analysis (LDA) is better than using LDA training accuracy or mutual information in our study. Our results also demonstrate that exhaustive search of a small subset with one SNP, two SNPs or 3 SNP subset based on best 100 composite 2-SNPs can find an optimal subset and further inclusion of more SNPs through heuristic algorithm doesn't always increase the performance of SNP subsets. Although sequential forward floating selection can be applied to prevent from the nesting effect of forward selection, it does not always out-perform the latter due to overfitting from observing more complex subset states. ^ Our results also indicate that HMSS as a criterion to evaluate the classification ability of a function can be used in imbalanced data without modifying the original dataset as against classification accuracy. Our four studies suggest that Sequential Information Bottleneck(sIB), a new unsupervised technique, can be adopted to predict the outcome and its ability to detect the target status is superior to the traditional LDA in the study. ^ From our results we can see that the best test probability-HMSS for predicting CVD, stroke,CAD and psoriasis through sIB is 0.59406, 0.641815, 0.645315 and 0.678658, respectively. In terms of group prediction accuracy, the highest test accuracy of sIB for diagnosing a normal status among controls can reach 0.708999, 0.863216, 0.639918 and 0.850275 respectively in the four studies if the test accuracy among cases is required to be not less than 0.4. On the other hand, the highest test accuracy of sIB for diagnosing a disease among cases can reach 0.748644, 0.789916, 0.705701 and 0.749436 respectively in the four studies if the test accuracy among controls is required to be at least 0.4. ^ A further genome-wide association study through Chi square test shows that there are no significant SNPs detected at the cut-off level 9.09451E-08 in the Framingham heart study of CVD. Study results in WTCCC can only detect two significant SNPs that are associated with CAD. In the genome-wide study of psoriasis most of top 20 SNP markers with impressive classification accuracy are also significantly associated with the disease through chi-square test at the cut-off value 1.11E-07. ^ Although our classification methods can achieve high accuracy in the study, complete descriptions of those classification results(95% confidence interval or statistical test of differences) require more cost-effective methods or efficient computing system, both of which can't be accomplished currently in our genome-wide study. We should also note that the purpose of this study is to identify subsets of SNPs with high prediction ability and those SNPs with good discriminant power are not necessary to be causal markers for the disease.^

Relevância:

80.00% 80.00%

Publicador:

Resumo:

The main purpose of a gene interaction network is to map the relationships of the genes that are out of sight when a genomic study is tackled. DNA microarrays allow the measure of gene expression of thousands of genes at the same time. These data constitute the numeric seed for the induction of the gene networks. In this paper, we propose a new approach to build gene networks by means of Bayesian classifiers, variable selection and bootstrap resampling. The interactions induced by the Bayesian classifiers are based both on the expression levels and on the phenotype information of the supervised variable. Feature selection and bootstrap resampling add reliability and robustness to the overall process removing the false positive findings. The consensus among all the induced models produces a hierarchy of dependences and, thus, of variables. Biologists can define the depth level of the model hierarchy so the set of interactions and genes involved can vary from a sparse to a dense set. Experimental results show how these networks perform well on classification tasks. The biological validation matches previous biological findings and opens new hypothesis for future studies

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Many diseases have a genetic origin, and a great effort is being made to detect the genes that are responsible for their insurgence. One of the most promising techniques is the analysis of genetic information through the use of complex networks theory. Yet, a practical problem of this approach is its computational cost, which scales as the square of the number of features included in the initial dataset. In this paper, we propose the use of an iterative feature selection strategy to identify reduced subsets of relevant features, and show an application to the analysis of congenital Obstructive Nephropathy. Results demonstrate that, besides achieving a drastic reduction of the computational cost, the topologies of the obtained networks still hold all the relevant information, and are thus able to fully characterize the severity of the disease.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

In this paper, we analyze the performance of several well-known pattern recognition and dimensionality reduction techniques when applied to mass-spectrometry data for odor biometric identification. Motivated by the successful results of previous works capturing the odor from other parts of the body, this work attempts to evaluate the feasibility of identifying people by the odor emanated from the hands. By formulating this task according to a machine learning scheme, the problem is identified with a small-sample-size supervised classification problem in which the input data is formed by mass spectrograms from the hand odor of 13 subjects captured in different sessions. The high dimensionality of the data makes it necessary to apply feature selection and extraction techniques together with a simple classifier in order to improve the generalization capabilities of the model. Our experimental results achieve recognition rates over 85% which reveals that there exists discriminatory information in the hand odor and points at body odor as a promising biometric identifier.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Nonlinear analysis tools for studying and characterizing the dynamics of physiological signals have gained popularity, mainly because tracking sudden alterations of the inherent complexity of biological processes might be an indicator of altered physiological states. Typically, in order to perform an analysis with such tools, the physiological variables that describe the biological process under study are used to reconstruct the underlying dynamics of the biological processes. For that goal, a procedure called time-delay or uniform embedding is usually employed. Nonetheless, there is evidence of its inability for dealing with non-stationary signals, as those recorded from many physiological processes. To handle with such a drawback, this paper evaluates the utility of non-conventional time series reconstruction procedures based on non uniform embedding, applying them to automatic pattern recognition tasks. The paper compares a state of the art non uniform approach with a novel scheme which fuses embedding and feature selection at once, searching for better reconstructions of the dynamics of the system. Moreover, results are also compared with two classic uniform embedding techniques. Thus, the goal is comparing uniform and non uniform reconstruction techniques, including the one proposed in this work, for pattern recognition in biomedical signal processing tasks. Once the state space is reconstructed, the scheme followed characterizes with three classic nonlinear dynamic features (Largest Lyapunov Exponent, Correlation Dimension and Recurrence Period Density Entropy), while classification is carried out by means of a simple k-nn classifier. In order to test its generalization capabilities, the approach was tested with three different physiological databases (Speech Pathologies, Epilepsy and Heart Murmurs). In terms of the accuracy obtained to automatically detect the presence of pathologies, and for the three types of biosignals analyzed, the non uniform techniques used in this work lightly outperformed the results obtained using the uniform methods, suggesting their usefulness to characterize non-stationary biomedical signals in pattern recognition applications. On the other hand, in view of the results obtained and its low computational load, the proposed technique suggests its applicability for the applications under study.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

This paper presents a preliminary study in which Machine Learning experiments applied to Opinion Mining in blogs have been carried out. We created and annotated a blog corpus in Spanish using EmotiBlog. We evaluated the utility of the features labelled firstly carrying out experiments with combinations of them and secondly using the feature selection techniques, we also deal with several problems, such as the noisy character of the input texts, the small size of the training set, the granularity of the annotation scheme and the language object of our study, Spanish, with less resource than English. We obtained promising results considering that it is a preliminary study.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Hypertrophic cardiomyopathy (HCM) is a cardiovascular disease where the heart muscle is partially thickened and blood flow is - potentially fatally - obstructed. It is one of the leading causes of sudden cardiac death in young people. Electrocardiography (ECG) and Echocardiography (Echo) are the standard tests for identifying HCM and other cardiac abnormalities. The American Heart Association has recommended using a pre-participation questionnaire for young athletes instead of ECG or Echo tests due to considerations of cost and time involved in interpreting the results of these tests by an expert cardiologist. Initially we set out to develop a classifier for automated prediction of young athletes’ heart conditions based on the answers to the questionnaire. Classification results and further in-depth analysis using computational and statistical methods indicated significant shortcomings of the questionnaire in predicting cardiac abnormalities. Automated methods for analyzing ECG signals can help reduce cost and save time in the pre-participation screening process by detecting HCM and other cardiac abnormalities. Therefore, the main goal of this dissertation work is to identify HCM through computational analysis of 12-lead ECG. ECG signals recorded on one or two leads have been analyzed in the past for classifying individual heartbeats into different types of arrhythmia as annotated primarily in the MIT-BIH database. In contrast, we classify complete sequences of 12-lead ECGs to assign patients into two groups: HCM vs. non-HCM. The challenges and issues we address include missing ECG waves in one or more leads and the dimensionality of a large feature-set. We address these by proposing imputation and feature-selection methods. We develop heartbeat-classifiers by employing Random Forests and Support Vector Machines, and propose a method to classify full 12-lead ECGs based on the proportion of heartbeats classified as HCM. The results from our experiments show that the classifiers developed using our methods perform well in identifying HCM. Thus the two contributions of this thesis are the utilization of computational and statistical methods for discovering shortcomings in a current screening procedure and the development of methods to identify HCM through computational analysis of 12-lead ECG signals.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Thesis (Ph.D.)--University of Washington, 2016-06

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Document classification is a supervised machine learning process, where predefined category labels are assigned to documents based on the hypothesis derived from training set of labelled documents. Documents cannot be directly interpreted by a computer system unless they have been modelled as a collection of computable features. Rogati and Yang [M. Rogati and Y. Yang, Resource selection for domain-specific cross-lingual IR, in SIGIR 2004: Proceedings of the 27th annual international conference on Research and Development in Information Retrieval, ACM Press, Sheffied: United Kingdom, pp. 154-161.] pointed out that the effectiveness of document classification system may vary in different domains. This implies that the quality of document model contributes to the effectiveness of document classification. Conventionally, model evaluation is accomplished by comparing the effectiveness scores of classifiers on model candidates. However, this kind of evaluation methods may encounter either under-fitting or over-fitting problems, because the effectiveness scores are restricted by the learning capacities of classifiers. We propose a model fitness evaluation method to determine whether a model is sufficient to distinguish positive and negative instances while still competent to provide satisfactory effectiveness with a small feature subset. Our experiments demonstrated how the fitness of models are assessed. The results of our work contribute to the researches of feature selection, dimensionality reduction and document classification.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

In this paper we explore the use of text-mining methods for the identification of the author of a text. We apply the support vector machine (SVM) to this problem, as it is able to cope with half a million of inputs it requires no feature selection and can process the frequency vector of all words of a text. We performed a number of experiments with texts from a German newspaper. With nearly perfect reliability the SVM was able to reject other authors and detected the target author in 60–80% of the cases. In a second experiment, we ignored nouns, verbs and adjectives and replaced them by grammatical tags and bigrams. This resulted in slightly reduced performance. Author detection with SVMs on full word forms was remarkably robust even if the author wrote about different topics.