776 resultados para Machine learning methods


Relevância:

90.00% 90.00%

Publicador:

Resumo:

Solving many scientific problems requires effective regression and/or classification models for large high-dimensional datasets. Experts from these problem domains (e.g. biologists, chemists, financial analysts) have insights into the domain which can be helpful in developing powerful models but they need a modelling framework that helps them to use these insights. Data visualisation is an effective technique for presenting data and requiring feedback from the experts. A single global regression model can rarely capture the full behavioural variability of a huge multi-dimensional dataset. Instead, local regression models, each focused on a separate area of input space, often work better since the behaviour of different areas may vary. Classical local models such as Mixture of Experts segment the input space automatically, which is not always effective and it also lacks involvement of the domain experts to guide a meaningful segmentation of the input space. In this paper we addresses this issue by allowing domain experts to interactively segment the input space using data visualisation. The segmentation output obtained is then further used to develop effective local regression models.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

The peculiarities of English language teaching for students at higher educational establishment using some elements of distance learning, developed by the author, are described in this article. The results of students’ questioning, received at the end of the experimental teaching, are suggested and analyzed. The conclusions are formulated and the further ways of teaching English with e-support are outlined.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

This paper addresses the task of learning classifiers from streams of labelled data. In this case we can face the problem that the underlying concepts can change over time. The paper studies two mechanisms developed for dealing with changing concepts. Both are based on the time window idea. The first one forgets gradually, by assigning to the examples weight that gradually decreases over time. The second one uses a statistical test to detect changes in concept and then optimizes the size of the time window, aiming to maximise the classification accuracy on the new examples. Both methods are general in nature and can be used with any learning algorithm. The objectives of the conducted experiments were to compare the mechanisms and explore whether they can be combined to achieve a synergetic e ect. Results from experiments with three basic learning algorithms (kNN, ID3 and NBC) using four datasets are reported and discussed.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Formal grammars can used for describing complex repeatable structures such as DNA sequences. In this paper, we describe the structural composition of DNA sequences using a context-free stochastic L-grammar. L-grammars are a special class of parallel grammars that can model the growth of living organisms, e.g. plant development, and model the morphology of a variety of organisms. We believe that parallel grammars also can be used for modeling genetic mechanisms and sequences such as promoters. Promoters are short regulatory DNA sequences located upstream of a gene. Detection of promoters in DNA sequences is important for successful gene prediction. Promoters can be recognized by certain patterns that are conserved within a species, but there are many exceptions which makes the promoter recognition a complex problem. We replace the problem of promoter recognition by induction of context-free stochastic L-grammar rules, which are later used for the structural analysis of promoter sequences. L-grammar rules are derived automatically from the drosophila and vertebrate promoter datasets using a genetic programming technique and their fitness is evaluated using a Support Vector Machine (SVM) classifier. The artificial promoter sequences generated using the derived L- grammar rules are analyzed and compared with natural promoter sequences.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

In this paper is described a didactic methodology combining current e-learning methods and the support of Intelligent Agents technologies. The aim is to favor the synthesis among theoretical approach and based practical approach using the so-called Intelligent Agent, software that exploits the Artificial Intelligence and that operates as tutor, facilitating the consumers in the training operations. The paper illustrates how such new Intelligent Agent algorithm (IA) is used in the training of employees working in the transportation sector, thanks to the experience gained with the PARMENIDE project - Promoting Advanced Resources and Methodologies for New Teaching and Learning Solutions in Digital Education.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

A rough set approach for attribute reduction is an important research subject in data mining and machine learning. However, most attribute reduction methods are performed on a complete decision system table. In this paper, we propose methods for attribute reduction in static incomplete decision systems and dynamic incomplete decision systems with dynamically-increasing and decreasing conditional attributes. Our methods use generalized discernibility matrix and function in tolerance-based rough sets.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Background Lifelong surveillance after endovascular repair (EVAR) of abdominal aortic aneurysms (AAA) is considered mandatory to detect potentially life-threatening endograft complications. A minority of patients require reintervention but cannot be predictively identified by existing methods. This study aimed to improve the prediction of endograft complications and mortality, through the application of machine-learning techniques. Methods Patients undergoing EVAR at 2 centres were studied from 2004-2010. Pre-operative aneurysm morphology was quantified and endograft complications were recorded up to 5 years following surgery. An artificial neural networks (ANN) approach was used to predict whether patients would be at low- or high-risk of endograft complications (aortic/limb) or mortality. Centre 1 data were used for training and centre 2 data for validation. ANN performance was assessed by Kaplan-Meier analysis to compare the incidence of aortic complications, limb complications, and mortality; in patients predicted to be low-risk, versus those predicted to be high-risk. Results 761 patients aged 75 +/- 7 years underwent EVAR. Mean follow-up was 36+/- 20 months. An ANN was created from morphological features including angulation/length/areas/diameters/ volume/tortuosity of the aneurysm neck/sac/iliac segments. ANN models predicted endograft complications and mortality with excellent discrimination between a low-risk and high-risk group. In external validation, the 5-year rates of freedom from aortic complications, limb complications and mortality were 95.9% vs 67.9%; 99.3% vs 92.0%; and 87.9% vs 79.3% respectively (p0.001) Conclusion This study presents ANN models that stratify the 5-year risk of endograft complications or mortality using routinely available pre-operative data.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Objective: To test the practicality and effectiveness of cheap, ubiquitous, consumer-grade smartphones to discriminate Parkinson’s disease (PD) subjects from healthy controls, using self-administered tests of gait and postural sway. Background: Existing tests for the diagnosis of PD are based on subjective neurological examinations, performed in-clinic. Objective movement symptom severity data, collected using widely-accessible technologies such as smartphones, would enable the remote characterization of PD symptoms based on self-administered, behavioral tests. Smartphones, when backed up by interviews using web-based videoconferencing, could make it feasible for expert neurologists to perform diagnostic testing on large numbers of individuals at low cost. However, to date, the compliance rate of testing using smart-phones has not been assessed. Methods: We conducted a one-month controlled study with twenty participants, comprising 10 PD subjects and 10 controls. All participants were provided identical LG Optimus S smartphones, capable of recording tri-axial acceleration. Using these smartphones, patients conducted self-administered, short (less than 5 minute) controlled gait and postural sway tests. We analyzed a wide range of summary measures of gait and postural sway from the accelerometry data. Using statistical machine learning techniques, we identified discriminating patterns in the summary measures in order to distinguish PD subjects from controls. Results: Compliance was high all 20 participants performed an average of 3.1 tests per day for the duration of the study. Using this test data, we demonstrated cross-validated sensitivity of 98% and specificity of 98% in discriminating PD subjects from healthy controls. Conclusions: Using consumer-grade smartphone accelerometers, it is possible to distinguish PD from healthy controls with high accuracy. Since these smartphones are inexpensive (around $30 each) and easily available, and the tests are highly non-invasive and objective, we envisage that this kind of smartphone-based testing could radically increase the reach and effectiveness of experts in diagnosing PD.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Heterogeneous datasets arise naturally in most applications due to the use of a variety of sensors and measuring platforms. Such datasets can be heterogeneous in terms of the error characteristics and sensor models. Treating such data is most naturally accomplished using a Bayesian or model-based geostatistical approach; however, such methods generally scale rather badly with the size of dataset, and require computationally expensive Monte Carlo based inference. Recently within the machine learning and spatial statistics communities many papers have explored the potential of reduced rank representations of the covariance matrix, often referred to as projected or fixed rank approaches. In such methods the covariance function of the posterior process is represented by a reduced rank approximation which is chosen such that there is minimal information loss. In this paper a sequential Bayesian framework for inference in such projected processes is presented. The observations are considered one at a time which avoids the need for high dimensional integrals typically required in a Bayesian approach. A C++ library, gptk, which is part of the INTAMAP web service, is introduced which implements projected, sequential estimation and adds several novel features. In particular the library includes the ability to use a generic observation operator, or sensor model, to permit data fusion. It is also possible to cope with a range of observation error characteristics, including non-Gaussian observation errors. Inference for the covariance parameters is explored, including the impact of the projected process approximation on likelihood profiles. We illustrate the projected sequential method in application to synthetic and real datasets. Limitations and extensions are discussed. © 2010 Elsevier Ltd.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Background: Major Depressive Disorder (MDD) is among the most prevalent and disabling medical conditions worldwide. Identification of clinical and biological markers ("biomarkers") of treatment response could personalize clinical decisions and lead to better outcomes. This paper describes the aims, design, and methods of a discovery study of biomarkers in antidepressant treatment response, conducted by the Canadian Biomarker Integration Network in Depression (CAN-BIND). The CAN-BIND research program investigates and identifies biomarkers that help to predict outcomes in patients with MDD treated with antidepressant medication. The primary objective of this initial study (known as CAN-BIND-1) is to identify individual and integrated neuroimaging, electrophysiological, molecular, and clinical predictors of response to sequential antidepressant monotherapy and adjunctive therapy in MDD. Methods: CAN-BIND-1 is a multisite initiative involving 6 academic health centres working collaboratively with other universities and research centres. In the 16-week protocol, patients with MDD are treated with a first-line antidepressant (escitalopram 10-20 mg/d) that, if clinically warranted after eight weeks, is augmented with an evidence-based, add-on medication (aripiprazole 2-10 mg/d). Comprehensive datasets are obtained using clinical rating scales; behavioural, dimensional, and functioning/quality of life measures; neurocognitive testing; genomic, genetic, and proteomic profiling from blood samples; combined structural and functional magnetic resonance imaging; and electroencephalography. De-identified data from all sites are aggregated within a secure neuroinformatics platform for data integration, management, storage, and analyses. Statistical analyses will include multivariate and machine-learning techniques to identify predictors, moderators, and mediators of treatment response. Discussion: From June 2013 to February 2015, a cohort of 134 participants (85 outpatients with MDD and 49 healthy participants) has been evaluated at baseline. The clinical characteristics of this cohort are similar to other studies of MDD. Recruitment at all sites is ongoing to a target sample of 290 participants. CAN-BIND will identify biomarkers of treatment response in MDD through extensive clinical, molecular, and imaging assessments, in order to improve treatment practice and clinical outcomes. It will also create an innovative, robust platform and database for future research. Trial registration: ClinicalTrials.gov identifier NCT01655706. Registered July 27, 2012.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Feature selection is important in medical field for many reasons. However, selecting important variables is a difficult task with the presence of censoring that is a unique feature in survival data analysis. This paper proposed an approach to deal with the censoring problem in endovascular aortic repair survival data through Bayesian networks. It was merged and embedded with a hybrid feature selection process that combines cox's univariate analysis with machine learning approaches such as ensemble artificial neural networks to select the most relevant predictive variables. The proposed algorithm was compared with common survival variable selection approaches such as; least absolute shrinkage and selection operator LASSO, and Akaike information criterion AIC methods. The results showed that it was capable of dealing with high censoring in the datasets. Moreover, ensemble classifiers increased the area under the roc curves of the two datasets collected from two centers located in United Kingdom separately. Furthermore, ensembles constructed with center 1 enhanced the concordance index of center 2 prediction compared to the model built with a single network. Although the size of the final reduced model using the neural networks and its ensembles is greater than other methods, the model outperformed the others in both concordance index and sensitivity for center 2 prediction. This indicates the reduced model is more powerful for cross center prediction.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Motivation: In any macromolecular polyprotic system - for example protein, DNA or RNA - the isoelectric point - commonly referred to as the pI - can be defined as the point of singularity in a titration curve, corresponding to the solution pH value at which the net overall surface charge - and thus the electrophoretic mobility - of the ampholyte sums to zero. Different modern analytical biochemistry and proteomics methods depend on the isoelectric point as a principal feature for protein and peptide characterization. Protein separation by isoelectric point is a critical part of 2-D gel electrophoresis, a key precursor of proteomics, where discrete spots can be digested in-gel, and proteins subsequently identified by analytical mass spectrometry. Peptide fractionation according to their pI is also widely used in current proteomics sample preparation procedures previous to the LC-MS/MS analysis. Therefore accurate theoretical prediction of pI would expedite such analysis. While such pI calculation is widely used, it remains largely untested, motivating our efforts to benchmark pI prediction methods. Results: Using data from the database PIP-DB and one publically available dataset as our reference gold standard, we have undertaken the benchmarking of pI calculation methods. We find that methods vary in their accuracy and are highly sensitive to the choice of basis set. The machine-learning algorithms, especially the SVM-based algorithm, showed a superior performance when studying peptide mixtures. In general, learning-based pI prediction methods (such as Cofactor, SVM and Branca) require a large training dataset and their resulting performance will strongly depend of the quality of that data. In contrast with Iterative methods, machine-learning algorithms have the advantage of being able to add new features to improve the accuracy of prediction. Contact: yperez@ebi.ac.uk Availability and Implementation: The software and data are freely available at https://github.com/ypriverol/pIR. Supplementary information: Supplementary data are available at Bioinformatics online.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

This thesis studies survival analysis techniques dealing with censoring to produce predictive tools that predict the risk of endovascular aortic aneurysm repair (EVAR) re-intervention. Censoring indicates that some patients do not continue follow up, so their outcome class is unknown. Methods dealing with censoring have drawbacks and cannot handle the high censoring of the two EVAR datasets collected. Therefore, this thesis presents a new solution to high censoring by modifying an approach that was incapable of differentiating between risks groups of aortic complications. Feature selection (FS) becomes complicated with censoring. Most survival FS methods depends on Cox's model, however machine learning classifiers (MLC) are preferred. Few methods adopted MLC to perform survival FS, but they cannot be used with high censoring. This thesis proposes two FS methods which use MLC to evaluate features. The two FS methods use the new solution to deal with censoring. They combine factor analysis with greedy stepwise FS search which allows eliminated features to enter the FS process. The first FS method searches for the best neural networks' configuration and subset of features. The second approach combines support vector machines, neural networks, and K nearest neighbor classifiers using simple and weighted majority voting to construct a multiple classifier system (MCS) for improving the performance of individual classifiers. It presents a new hybrid FS process by using MCS as a wrapper method and merging it with the iterated feature ranking filter method to further reduce the features. The proposed techniques outperformed FS methods based on Cox's model such as; Akaike and Bayesian information criteria, and least absolute shrinkage and selector operator in the log-rank test's p-values, sensitivity, and concordance. This proves that the proposed techniques are more powerful in correctly predicting the risk of re-intervention. Consequently, they enable doctors to set patients’ appropriate future observation plan.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

In product reviews, it is observed that the distribution of polarity ratings over reviews written by different users or evaluated based on different products are often skewed in the real world. As such, incorporating user and product information would be helpful for the task of sentiment classification of reviews. However, existing approaches ignored the temporal nature of reviews posted by the same user or evaluated on the same product. We argue that the temporal relations of reviews might be potentially useful for learning user and product embedding and thus propose employing a sequence model to embed these temporal relations into user and product representations so as to improve the performance of document-level sentiment analysis. Specifically, we first learn a distributed representation of each review by a one-dimensional convolutional neural network. Then, taking these representations as pretrained vectors, we use a recurrent neural network with gated recurrent units to learn distributed representations of users and products. Finally, we feed the user, product and review representations into a machine learning classifier for sentiment classification. Our approach has been evaluated on three large-scale review datasets from the IMDB and Yelp. Experimental results show that: (1) sequence modeling for the purposes of distributed user and product representation learning can improve the performance of document-level sentiment classification; (2) the proposed approach achieves state-of-The-Art results on these benchmark datasets.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

The primary aim of this dissertation is to develop data mining tools for knowledge discovery in biomedical data when multiple (homogeneous or heterogeneous) sources of data are available. The central hypothesis is that, when information from multiple sources of data are used appropriately and effectively, knowledge discovery can be better achieved than what is possible from only a single source. ^ Recent advances in high-throughput technology have enabled biomedical researchers to generate large volumes of diverse types of data on a genome-wide scale. These data include DNA sequences, gene expression measurements, and much more; they provide the motivation for building analysis tools to elucidate the modular organization of the cell. The challenges include efficiently and accurately extracting information from the multiple data sources; representing the information effectively, developing analytical tools, and interpreting the results in the context of the domain. ^ The first part considers the application of feature-level integration to design classifiers that discriminate between soil types. The machine learning tools, SVM and KNN, were used to successfully distinguish between several soil samples. ^ The second part considers clustering using multiple heterogeneous data sources. The resulting Multi-Source Clustering (MSC) algorithm was shown to have a better performance than clustering methods that use only a single data source or a simple feature-level integration of heterogeneous data sources. ^ The third part proposes a new approach to effectively incorporate incomplete data into clustering analysis. Adapted from K-means algorithm, the Generalized Constrained Clustering (GCC) algorithm makes use of incomplete data in the form of constraints to perform exploratory analysis. Novel approaches for extracting constraints were proposed. For sufficiently large constraint sets, the GCC algorithm outperformed the MSC algorithm. ^ The last part considers the problem of providing a theme-specific environment for mining multi-source biomedical data. The database called PlasmoTFBM, focusing on gene regulation of Plasmodium falciparum, contains diverse information and has a simple interface to allow biologists to explore the data. It provided a framework for comparing different analytical tools for predicting regulatory elements and for designing useful data mining tools. ^ The conclusion is that the experiments reported in this dissertation strongly support the central hypothesis.^