927 resultados para Customer feature selection


Relevância:

80.00% 80.00%

Publicador:

Resumo:

A organização automática de mensagens de correio electrónico é um desafio actual na área da aprendizagem automática. O número excessivo de mensagens afecta cada vez mais utilizadores, especialmente os que usam o correio electrónico como ferramenta de comunicação e trabalho. Esta tese aborda o problema da organização automática de mensagens de correio electrónico propondo uma solução que tem como objectivo a etiquetagem automática de mensagens. A etiquetagem automática é feita com recurso às pastas de correio electrónico anteriormente criadas pelos utilizadores, tratando-as como etiquetas, e à sugestão de múltiplas etiquetas para cada mensagem (top-N). São estudadas várias técnicas de aprendizagem e os vários campos que compõe uma mensagem de correio electrónico são analisados de forma a determinar a sua adequação como elementos de classificação. O foco deste trabalho recai sobre os campos textuais (o assunto e o corpo das mensagens), estudando-se diferentes formas de representação, selecção de características e algoritmos de classificação. É ainda efectuada a avaliação dos campos de participantes através de algoritmos de classificação que os representam usando o modelo vectorial ou como um grafo. Os vários campos são combinados para classificação utilizando a técnica de combinação de classificadores Votação por Maioria. Os testes são efectuados com um subconjunto de mensagens de correio electrónico da Enron e um conjunto de dados privados disponibilizados pelo Institute for Systems and Technologies of Information, Control and Communication (INSTICC). Estes conjuntos são analisados de forma a perceber as características dos dados. A avaliação do sistema é realizada através da percentagem de acerto dos classificadores. Os resultados obtidos apresentam melhorias significativas em comparação com os trabalhos relacionados.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Dissertação para obtenção do grau de Mestre em Engenharia Informática

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Trabalho de Projeto para obtenção do grau de Mestre em Engenharia Informática e de Computadores

Relevância:

80.00% 80.00%

Publicador:

Resumo:

In data clustering, the problem of selecting the subset of most relevant features from the data has been an active research topic. Feature selection for clustering is a challenging task due to the absence of class labels for guiding the search for relevant features. Most methods proposed for this goal are focused on numerical data. In this work, we propose an approach for clustering and selecting categorical features simultaneously. We assume that the data originate from a finite mixture of multinomial distributions and implement an integrated expectation-maximization (EM) algorithm that estimates all the parameters of the model and selects the subset of relevant features simultaneously. The results obtained on synthetic data illustrate the performance of the proposed approach. An application to real data, referred to official statistics, shows its usefulness.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Locomotor tasks characterization plays an important role in trying to improve the quality of life of a growing elderly population. This paper focuses on this matter by trying to characterize the locomotion of two population groups with different functional fitness levels (high or low) while executing three different tasks-gait, stair ascent and stair descent. Features were extracted from gait data, and feature selection methods were used in order to get the set of features that allow differentiation between functional fitness level. Unsupervised learning was used to validate the sets obtained and, ultimately, indicated that it is possible to distinguish the two population groups. The sets of best discriminate features for each task are identified and thoroughly analysed. Copyright © 2014 SCITEPRESS - Science and Technology Publications. All rights reserved.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

In the field of appearance-based robot localization, the mainstream approach uses a quantized representation of local image features. An alternative strategy is the exploitation of raw feature descriptors, thus avoiding approximations due to quantization. In this work, the quantized and non-quantized representations are compared with respect to their discriminativity, in the context of the robot global localization problem. Having demonstrated the advantages of the non-quantized representation, the paper proposes mechanisms to reduce the computational burden this approach would carry, when applied in its simplest form. This reduction is achieved through a hierarchical strategy which gradually discards candidate locations and by exploring two simplifying assumptions about the training data. The potential of the non-quantized representation is exploited by resorting to the entropy-discriminativity relation. The idea behind this approach is that the non-quantized representation facilitates the assessment of the distinctiveness of features, through the entropy measure. Building on this finding, the robustness of the localization system is enhanced by modulating the importance of features according to the entropy measure. Experimental results support the effectiveness of this approach, as well as the validity of the proposed computation reduction methods.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Dissertation to Obtain Master Degree in Biomedical Engineering

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Human Activity Recognition systems require objective and reliable methods that can be used in the daily routine and must offer consistent results according with the performed activities. These systems are under development and offer objective and personalized support for several applications such as the healthcare area. This thesis aims to create a framework for human activities recognition based on accelerometry signals. Some new features and techniques inspired in the audio recognition methodology are introduced in this work, namely Log Scale Power Bandwidth and the Markov Models application. The Forward Feature Selection was adopted as the feature selection algorithm in order to improve the clustering performances and limit the computational demands. This method selects the most suitable set of features for activities recognition in accelerometry from a 423th dimensional feature vector. Several Machine Learning algorithms were applied to the used accelerometry databases – FCHA and PAMAP databases - and these showed promising results in activities recognition. The developed algorithm set constitutes a mighty contribution for the development of reliable evaluation methods of movement disorders for diagnosis and treatment applications.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Botnets are a group of computers infected with a specific sub-set of a malware family and controlled by one individual, called botmaster. This kind of networks are used not only, but also for virtual extorsion, spam campaigns and identity theft. They implement different types of evasion techniques that make it harder for one to group and detect botnet traffic. This thesis introduces one methodology, called CONDENSER, that outputs clusters through a self-organizing map and that identify domain names generated by an unknown pseudo-random seed that is known by the botnet herder(s). Aditionally DNS Crawler is proposed, this system saves historic DNS data for fast-flux and double fastflux detection, and is used to identify live C&Cs IPs used by real botnets. A program, called CHEWER, was developed to automate the calculation of the SVM parameters and features that better perform against the available domain names associated with DGAs. CONDENSER and DNS Crawler were developed with scalability in mind so the detection of fast-flux and double fast-flux networks become faster. We used a SVM for the DGA classififer, selecting a total of 11 attributes and achieving a Precision of 77,9% and a F-Measure of 83,2%. The feature selection method identified the 3 most significant attributes of the total set of attributes. For clustering, a Self-Organizing Map was used on a total of 81 attributes. The conclusions of this thesis were accepted in Botconf through a submited article. Botconf is known conferênce for research, mitigation and discovery of botnets tailled for the industry, where is presented current work and research. This conference is known for having security and anti-virus companies, law enforcement agencies and researchers.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Dissertação de mestrado integrado em Engenharia Biomédica (área de especialização em Eletrónica Médica)

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Dissertação de mestrado integrado em Engenharia Biomédica (área de especialização em Eletrónica Médica)

Relevância:

80.00% 80.00%

Publicador:

Resumo:

The chemical composition of propolis is affected by environmental factors and harvest season, making it difficult to standardize its extracts for medicinal usage. By detecting a typical chemical profile associated with propolis from a specific production region or season, certain types of propolis may be used to obtain a specific pharmacological activity. In this study, propolis from three agroecological regions (plain, plateau, and highlands) from southern Brazil, collected over the four seasons of 2010, were investigated through a novel NMR-based metabolomics data analysis workflow. Chemometrics and machine learning algorithms (PLS-DA and RF), including methods to estimate variable importance in classification, were used in this study. The machine learning and feature selection methods permitted construction of models for propolis sample classification with high accuracy (>75%, reaching 90% in the best case), better discriminating samples regarding their collection seasons comparatively to the harvest regions. PLS-DA and RF allowed the identification of biomarkers for sample discrimination, expanding the set of discriminating features and adding relevant information for the identification of the class-determining metabolites. The NMR-based metabolomics analytical platform, coupled to bioinformatic tools, allowed characterization and classification of Brazilian propolis samples regarding the metabolite signature of important compounds, i.e., chemical fingerprint, harvest seasons, and production regions.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Recently, there has been a growing interest in the field of metabolomics, materialized by a remarkable growth in experimental techniques, available data and related biological applications. Indeed, techniques as Nuclear Magnetic Resonance, Gas or Liquid Chromatography, Mass Spectrometry, Infrared and UV-visible spectroscopies have provided extensive datasets that can help in tasks as biological and biomedical discovery, biotechnology and drug development. However, as it happens with other omics data, the analysis of metabolomics datasets provides multiple challenges, both in terms of methodologies and in the development of appropriate computational tools. Indeed, from the available software tools, none addresses the multiplicity of existing techniques and data analysis tasks. In this work, we make available a novel R package, named specmine, which provides a set of methods for metabolomics data analysis, including data loading in different formats, pre-processing, metabolite identification, univariate and multivariate data analysis, machine learning, and feature selection. Importantly, the implemented methods provide adequate support for the analysis of data from diverse experimental techniques, integrating a large set of functions from several R packages in a powerful, yet simple to use environment. The package, already available in CRAN, is accompanied by a web site where users can deposit datasets, scripts and analysis reports to be shared with the community, promoting the efficient sharing of metabolomics data analysis pipelines.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Difficult tracheal intubation assessment is an important research topic in anesthesia as failed intubations are important causes of mortality in anesthetic practice. The modified Mallampati score is widely used, alone or in conjunction with other criteria, to predict the difficulty of intubation. This work presents an automatic method to assess the modified Mallampati score from an image of a patient with the mouth wide open. For this purpose we propose an active appearance models (AAM) based method and use linear support vector machines (SVM) to select a subset of relevant features obtained using the AAM. This feature selection step proves to be essential as it improves drastically the performance of classification, which is obtained using SVM with RBF kernel and majority voting. We test our method on images of 100 patients undergoing elective surgery and achieve 97.9% accuracy in the leave-one-out crossvalidation test and provide a key element to an automatic difficult intubation assessment system.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Introduction: As part of the MicroArray Quality Control (MAQC)-II project, this analysis examines how the choice of univariate feature-selection methods and classification algorithms may influence the performance of genomic predictors under varying degrees of prediction difficulty represented by three clinically relevant endpoints. Methods: We used gene-expression data from 230 breast cancers (grouped into training and independent validation sets), and we examined 40 predictors (five univariate feature-selection methods combined with eight different classifiers) for each of the three endpoints. Their classification performance was estimated on the training set by using two different resampling methods and compared with the accuracy observed in the independent validation set. Results: A ranking of the three classification problems was obtained, and the performance of 120 models was estimated and assessed on an independent validation set. The bootstrapping estimates were closer to the validation performance than were the cross-validation estimates. The required sample size for each endpoint was estimated, and both gene-level and pathway-level analyses were performed on the obtained models. Conclusions: We showed that genomic predictor accuracy is determined largely by an interplay between sample size and classification difficulty. Variations on univariate feature-selection methods and choice of classification algorithm have only a modest impact on predictor performance, and several statistically equally good predictors can be developed for any given classification problem.