22 resultados para machine learning, decision tree, concept drift, ensemble learning, classication, random forest


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Background: Allergy is a form of hypersensitivity to normally innocuous substances, such as dust, pollen, foods or drugs. Allergens are small antigens that commonly provoke an IgE antibody response. There are two types of bioinformatics-based allergen prediction. The first approach follows FAO/WHO Codex alimentarius guidelines and searches for sequence similarity. The second approach is based on identifying conserved allergenicity-related linear motifs. Both approaches assume that allergenicity is a linearly coded property. In the present study, we applied ACC pre-processing to sets of known allergens, developing alignment-independent models for allergen recognition based on the main chemical properties of amino acid sequences.Results: A set of 684 food, 1,156 inhalant and 555 toxin allergens was collected from several databases. A set of non-allergens from the same species were selected to mirror the allergen set. The amino acids in the protein sequences were described by three z-descriptors (z1, z2 and z3) and by auto- and cross-covariance (ACC) transformation were converted into uniform vectors. Each protein was presented as a vector of 45 variables. Five machine learning methods for classification were applied in the study to derive models for allergen prediction. The methods were: discriminant analysis by partial least squares (DA-PLS), logistic regression (LR), decision tree (DT), naïve Bayes (NB) and k nearest neighbours (kNN). The best performing model was derived by kNN at k = 3. It was optimized, cross-validated and implemented in a server named AllerTOP, freely accessible at http://www.pharmfac.net/allertop. AllerTOP also predicts the most probable route of exposure. In comparison to other servers for allergen prediction, AllerTOP outperforms them with 94% sensitivity.Conclusions: AllerTOP is the first alignment-free server for in silico prediction of allergens based on the main physicochemical properties of proteins. Significantly, as well allergenicity AllerTOP is able to predict the route of allergen exposure: food, inhalant or toxin. © 2013 Dimitrov et al.; licensee BioMed Central Ltd.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In product reviews, it is observed that the distribution of polarity ratings over reviews written by different users or evaluated based on different products are often skewed in the real world. As such, incorporating user and product information would be helpful for the task of sentiment classification of reviews. However, existing approaches ignored the temporal nature of reviews posted by the same user or evaluated on the same product. We argue that the temporal relations of reviews might be potentially useful for learning user and product embedding and thus propose employing a sequence model to embed these temporal relations into user and product representations so as to improve the performance of document-level sentiment analysis. Specifically, we first learn a distributed representation of each review by a one-dimensional convolutional neural network. Then, taking these representations as pretrained vectors, we use a recurrent neural network with gated recurrent units to learn distributed representations of users and products. Finally, we feed the user, product and review representations into a machine learning classifier for sentiment classification. Our approach has been evaluated on three large-scale review datasets from the IMDB and Yelp. Experimental results show that: (1) sequence modeling for the purposes of distributed user and product representation learning can improve the performance of document-level sentiment classification; (2) the proposed approach achieves state-of-The-Art results on these benchmark datasets.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This thesis presents an investigation, of synchronisation and causality, motivated by problems in computational neuroscience. The thesis addresses both theoretical and practical signal processing issues regarding the estimation of interdependence from a set of multivariate data generated by a complex underlying dynamical system. This topic is driven by a series of problems in neuroscience, which represents the principal background motive behind the material in this work. The underlying system is the human brain and the generative process of the data is based on modern electromagnetic neuroimaging methods . In this thesis, the underlying functional of the brain mechanisms are derived from the recent mathematical formalism of dynamical systems in complex networks. This is justified principally on the grounds of the complex hierarchical and multiscale nature of the brain and it offers new methods of analysis to model its emergent phenomena. A fundamental approach to study the neural activity is to investigate the connectivity pattern developed by the brain’s complex network. Three types of connectivity are important to study: 1) anatomical connectivity refering to the physical links forming the topology of the brain network; 2) effective connectivity concerning with the way the neural elements communicate with each other using the brain’s anatomical structure, through phenomena of synchronisation and information transfer; 3) functional connectivity, presenting an epistemic concept which alludes to the interdependence between data measured from the brain network. The main contribution of this thesis is to present, apply and discuss novel algorithms of functional connectivities, which are designed to extract different specific aspects of interaction between the underlying generators of the data. Firstly, a univariate statistic is developed to allow for indirect assessment of synchronisation in the local network from a single time series. This approach is useful in inferring the coupling as in a local cortical area as observed by a single measurement electrode. Secondly, different existing methods of phase synchronisation are considered from the perspective of experimental data analysis and inference of coupling from observed data. These methods are designed to address the estimation of medium to long range connectivity and their differences are particularly relevant in the context of volume conduction, that is known to produce spurious detections of connectivity. Finally, an asymmetric temporal metric is introduced in order to detect the direction of the coupling between different regions of the brain. The method developed in this thesis is based on a machine learning extensions of the well known concept of Granger causality. The thesis discussion is developed alongside examples of synthetic and experimental real data. The synthetic data are simulations of complex dynamical systems with the intention to mimic the behaviour of simple cortical neural assemblies. They are helpful to test the techniques developed in this thesis. The real datasets are provided to illustrate the problem of brain connectivity in the case of important neurological disorders such as Epilepsy and Parkinson’s disease. The methods of functional connectivity in this thesis are applied to intracranial EEG recordings in order to extract features, which characterize underlying spatiotemporal dynamics before during and after an epileptic seizure and predict seizure location and onset prior to conventional electrographic signs. The methodology is also applied to a MEG dataset containing healthy, Parkinson’s and dementia subjects with the scope of distinguishing patterns of pathological from physiological connectivity.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Stochastic differential equations arise naturally in a range of contexts, from financial to environmental modeling. Current solution methods are limited in their representation of the posterior process in the presence of data. In this work, we present a novel Gaussian process approximation to the posterior measure over paths for a general class of stochastic differential equations in the presence of observations. The method is applied to two simple problems: the Ornstein-Uhlenbeck process, of which the exact solution is known and can be compared to, and the double-well system, for which standard approaches such as the ensemble Kalman smoother fail to provide a satisfactory result. Experiments show that our variational approximation is viable and that the results are very promising as the variational approximate solution outperforms standard Gaussian process regression for non-Gaussian Markov processes.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Electrocardiography (ECG) has been recently proposed as biometric trait for identification purposes. Intra-individual variations of ECG might affect identification performance. These variations are mainly due to Heart Rate Variability (HRV). In particular, HRV causes changes in the QT intervals along the ECG waveforms. This work is aimed at analysing the influence of seven QT interval correction methods (based on population models) on the performance of ECG-fiducial-based identification systems. In addition, we have also considered the influence of training set size, classifier, classifier ensemble as well as the number of consecutive heartbeats in a majority voting scheme. The ECG signals used in this study were collected from thirty-nine subjects within the Physionet open access database. Public domain software was used for fiducial points detection. Results suggested that QT correction is indeed required to improve the performance. However, there is no clear choice among the seven explored approaches for QT correction (identification rate between 0.97 and 0.99). MultiLayer Perceptron and Support Vector Machine seemed to have better generalization capabilities, in terms of classification performance, with respect to Decision Tree-based classifiers. No such strong influence of the training-set size and the number of consecutive heartbeats has been observed on the majority voting scheme.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

We propose a family of attributed graph kernels based on mutual information measures, i.e., the Jensen-Tsallis (JT) q-differences (for q  ∈ [1,2]) between probability distributions over the graphs. To this end, we first assign a probability to each vertex of the graph through a continuous-time quantum walk (CTQW). We then adopt the tree-index approach [1] to strengthen the original vertex labels, and we show how the CTQW can induce a probability distribution over these strengthened labels. We show that our JT kernel (for q  = 1) overcomes the shortcoming of discarding non-isomorphic substructures arising in the R-convolution kernels. Moreover, we prove that the proposed JT kernels generalize the Jensen-Shannon graph kernel [2] (for q = 1) and the classical subtree kernel [3] (for q = 2), respectively. Experimental evaluations demonstrate the effectiveness and efficiency of the JT kernels.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Feature selection is important in medical field for many reasons. However, selecting important variables is a difficult task with the presence of censoring that is a unique feature in survival data analysis. This paper proposed an approach to deal with the censoring problem in endovascular aortic repair survival data through Bayesian networks. It was merged and embedded with a hybrid feature selection process that combines cox's univariate analysis with machine learning approaches such as ensemble artificial neural networks to select the most relevant predictive variables. The proposed algorithm was compared with common survival variable selection approaches such as; least absolute shrinkage and selection operator LASSO, and Akaike information criterion AIC methods. The results showed that it was capable of dealing with high censoring in the datasets. Moreover, ensemble classifiers increased the area under the roc curves of the two datasets collected from two centers located in United Kingdom separately. Furthermore, ensembles constructed with center 1 enhanced the concordance index of center 2 prediction compared to the model built with a single network. Although the size of the final reduced model using the neural networks and its ensembles is greater than other methods, the model outperformed the others in both concordance index and sensitivity for center 2 prediction. This indicates the reduced model is more powerful for cross center prediction.