813 resultados para microarray data classification


Relevância:

30.00% 30.00%

Publicador:

Resumo:

We present a Bayesian image classification scheme for discriminating cloud, clear and sea-ice observations at high latitudes to improve identification of areas of clear-sky over ice-free ocean for SST retrieval. We validate the image classification against a manually classified dataset using Advanced Along Track Scanning Radiometer (AATSR) data. A three way classification scheme using a near-infrared textural feature improves classifier accuracy by 9.9 % over the nadir only version of the cloud clearing used in the ATSR Reprocessing for Climate (ARC) project in high latitude regions. The three way classification gives similar numbers of cloud and ice scenes misclassified as clear but significantly more clear-sky cases are correctly identified (89.9 % compared with 65 % for ARC). We also demonstrate the poetential of a Bayesian image classifier including information from the 0.6 micron channel to be used in sea-ice extent and ice surface temperature retrieval with 77.7 % of ice scenes correctly identified and an overall classifier accuracy of 96 %.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Automatic generation of classification rules has been an increasingly popular technique in commercial applications such as Big Data analytics, rule based expert systems and decision making systems. However, a principal problem that arises with most methods for generation of classification rules is the overfit-ting of training data. When Big Data is dealt with, this may result in the generation of a large number of complex rules. This may not only increase computational cost but also lower the accuracy in predicting further unseen instances. This has led to the necessity of developing pruning methods for the simplification of rules. In addition, classification rules are used further to make predictions after the completion of their generation. As efficiency is concerned, it is expected to find the first rule that fires as soon as possible by searching through a rule set. Thus a suit-able structure is required to represent the rule set effectively. In this chapter, the authors introduce a unified framework for construction of rule based classification systems consisting of three operations on Big Data: rule generation, rule simplification and rule representation. The authors also review some existing methods and techniques used for each of the three operations and highlight their limitations. They introduce some novel methods and techniques developed by them recently. These methods and techniques are also discussed in comparison to existing ones with respect to efficient processing of Big Data.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The past years have shown an enormous advancement in sequencing and array-based technologies, producing supplementary or alternative views of the genome stored in various formats and databases. Their sheer volume and different data scope pose a challenge to jointly visualize and integrate diverse data types. We present AmalgamScope a new interactive software tool focusing on assisting scientists with the annotation of the human genome and particularly the integration of the annotation files from multiple data types, using gene identifiers and genomic coordinates. Supported platforms include next-generation sequencing and microarray technologies. The available features of AmalgamScope range from the annotation of diverse data types across the human genome to integration of the data based on the annotational information and visualization of the merged files within chromosomal regions or the whole genome. Additionally, users can define custom transcriptome library files for any species and use the file exchanging distant server options of the tool.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Three coupled knowledge transfer partnerships used pattern recognition techniques to produce an e-procurement system which, the National Audit Office reports, could save the National Health Service £500 m per annum. An extension to the system, GreenInsight, allows the environmental impact of procurements to be assessed and savings made. Both systems require suitable products to be discovered and equivalent products recognised, for which classification is a key component. This paper describes the innovative work done for product classification, feature selection and reducing the impact of mislabelled data.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The EU Water Framework Directive (WFD) requires that the ecological and chemical status of water bodies in Europe should be assessed, and action taken where possible to ensure that at least "good" quality is attained in each case by 2015. This paper is concerned with the accuracy and precision with which chemical status in rivers can be measured given certain sampling strategies, and how this can be improved. High-frequency (hourly) chemical data from four rivers in southern England were subsampled to simulate different sampling strategies for four parameters used for WFD classification: dissolved phosphorus, dissolved oxygen, pH and water temperature. These data sub-sets were then used to calculate the WFD classification for each site. Monthly sampling was less precise than weekly sampling, but the effect on WFD classification depended on the closeness of the range of concentrations to the class boundaries. In some cases, monthly sampling for a year could result in the same water body being assigned to three or four of the WFD classes with 95% confidence, due to random sampling effects, whereas with weekly sampling this was one or two classes for the same cases. In the most extreme case, the same water body could have been assigned to any of the five WFD quality classes. Weekly sampling considerably reduces the uncertainties compared to monthly sampling. The width of the weekly sampled confidence intervals was about 33% that of the monthly for P species and pH, about 50% for dissolved oxygen, and about 67% for water temperature. For water temperature, which is assessed as the 98th percentile in the UK, monthly sampling biases the mean downwards by about 1 °C compared to the true value, due to problems of assessing high percentiles with limited data. Low-frequency measurements will generally be unsuitable for assessing standards expressed as high percentiles. Confining sampling to the working week compared to all 7 days made little difference, but a modest improvement in precision could be obtained by sampling at the same time of day within a 3 h time window, and this is recommended. For parameters with a strong diel variation, such as dissolved oxygen, the value obtained, and thus possibly the WFD classification, can depend markedly on when in the cycle the sample was taken. Specifying this in the sampling regime would be a straightforward way to improve precision, but there needs to be agreement about how best to characterise risk in different types of river. These results suggest that in some cases it will be difficult to assign accurate WFD chemical classes or to detect likely trends using current sampling regimes, even for these largely groundwater-fed rivers. A more critical approach to sampling is needed to ensure that management actions are appropriate and supported by data.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Parkinson is a neurodegenerative disease, in which tremor is the main symptom. This paper investigates the use of different classification methods to identify tremors experienced by Parkinsonian patients.Some previous research has focussed tremor analysis on external body signals (e.g., electromyography, accelerometer signals, etc.). Our advantage is that we have access to sub-cortical data, which facilitates the applicability of the obtained results into real medical devices since we are dealing with brain signals directly. Local field potentials (LFP) were recorded in the subthalamic nucleus of 7 Parkinsonian patients through the implanted electrodes of a deep brain stimulation (DBS) device prior to its internalization. Measured LFP signals were preprocessed by means of splinting, down sampling, filtering, normalization and rec-tification. Then, feature extraction was conducted through a multi-level decomposition via a wavelettrans form. Finally, artificial intelligence techniques were applied to feature selection, clustering of tremor types, and tremor detection.The key contribution of this paper is to present initial results which indicate, to a high degree of certainty, that there appear to be two distinct subgroups of patients within the group-1 of patients according to the Consensus Statement of the Movement Disorder Society on Tremor. Such results may well lead to different resultant treatments for the patients involved, depending on how their tremor has been classified. Moreover, we propose a new approach for demand driven stimulation, in which tremor detection is also based on the subtype of tremor the patient has. Applying this knowledge to the tremor detection problem, it can be concluded that the results improve when patient clustering is applied prior to detection.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This work investigates the problem of feature selection in neuroimaging features from structural MRI brain images for the classification of subjects as healthy controls, suffering from Mild Cognitive Impairment or Alzheimer’s Disease. A Genetic Algorithm wrapper method for feature selection is adopted in conjunction with a Support Vector Machine classifier. In very large feature sets, feature selection is found to be redundant as the accuracy is often worsened when compared to an Support Vector Machine with no feature selection. However, when just the hippocampal subfields are used, feature selection shows a significant improvement of the classification accuracy. Three-class Support Vector Machines and two-class Support Vector Machines combined with weighted voting are also compared with the former and found more useful. The highest accuracy achieved at classifying the test data was 65.5% using a genetic algorithm for feature selection with a three-class Support Vector Machine classifier.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Introduction Human immunodeficiency virus (HIV) is a serious disease which can be associated with various activity limitations and participation restrictions. The aim of this paper was to describe how HIV affects the functioning and health of people within different environmental contexts, particularly with regard to access to medication. Method Four cross-sectional studies, three in South Africa and one in Brazil, had applied the International Classification of Functioning, Disability and Health (ICF) as a classification instrument to participants living with HIV. Each group was at a different stage of the disease. Only two groups had had continuing access to antiretroviral therapy. The existence of these descriptive sets enabled comparison of the disability experienced by people living with HIV at different stages of the disease and with differing access to antiretroviral therapy. Results Common problems experienced in all groups related to weight maintenance, with two-thirds of the sample reporting problems in this area. Mental functions presented the most problems in all groups, with sleep (50%, 92/185), energy and drive (45%, 83/185), and emotional functions (49%, 90/185) being the most affected. In those on long-term therapy, body image affected 93% (39/42) and was a major problem. The other groups reported pain as a problem, and those with limited access to treatment also reported mobility problems. Cardiopulmonary functions were affected in all groups. Conclusion Functional problems occurred in the areas of impairment and activity limitation in people at advanced stages of HIV, and more limitations occurred in the area of participation for those on antiretroviral treatment. The ICF provided a useful framework within which to describe the functioning of those with HIV and the impact of the environment. Given the wide spectrum of problems found, consideration could be given to a number of ICF core sets that are relevant to the different stages of HIV disease. (C) 2010 Chartered Society of Physiotherapy. Published by Elsevier Ltd. All rights reserved.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Astronomy has evolved almost exclusively by the use of spectroscopic and imaging techniques, operated separately. With the development of modern technologies, it is possible to obtain data cubes in which one combines both techniques simultaneously, producing images with spectral resolution. To extract information from them can be quite complex, and hence the development of new methods of data analysis is desirable. We present a method of analysis of data cube (data from single field observations, containing two spatial and one spectral dimension) that uses Principal Component Analysis (PCA) to express the data in the form of reduced dimensionality, facilitating efficient information extraction from very large data sets. PCA transforms the system of correlated coordinates into a system of uncorrelated coordinates ordered by principal components of decreasing variance. The new coordinates are referred to as eigenvectors, and the projections of the data on to these coordinates produce images we will call tomograms. The association of the tomograms (images) to eigenvectors (spectra) is important for the interpretation of both. The eigenvectors are mutually orthogonal, and this information is fundamental for their handling and interpretation. When the data cube shows objects that present uncorrelated physical phenomena, the eigenvector`s orthogonality may be instrumental in separating and identifying them. By handling eigenvectors and tomograms, one can enhance features, extract noise, compress data, extract spectra, etc. We applied the method, for illustration purpose only, to the central region of the low ionization nuclear emission region (LINER) galaxy NGC 4736, and demonstrate that it has a type 1 active nucleus, not known before. Furthermore, we show that it is displaced from the centre of its stellar bulge.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Epidendrum L. is the largest genus of Orchidaceae in the Neotropical region; it has an impressive morphological diversification, which imposes difficulties in delimitation of both infrageneric and interspecific boundaries. In this study, we review infrageneric boundaries within the subgenus Amphiglottium and try to contribute to the understanding of morphological diversification and taxa delimitation within this group. We tested the monophyly of the subgenus Amphiglottium sect. Amphiglottium, expanding previous phylogenetic investigations and reevaluated previous infrageneric classifications proposed. Sequence data from the trnL-trnF region were analyzed with both parsimony and maximum likelihood criteria. AFLP markers were also obtained and analyzed with phylogenetic and principal coordinate analyses. Additionally, we obtained chromosome numbers for representative species within the group. The results strengthen the monophyly of the subgenus Amphiglottium but do not support the current classification system proposed by previous authors. Only section Tuberculata comprises a well-supported monophyletic group, with sections Carinata and Integra not supported. Instead of morphology, biogeographical and ecological patterns are reflected in the phylogenetic signal in this group. This study also confirms the large variability of chromosome numbers for the subgenus Amphiglottium (numbers ranging from 2n = 24 to 2n = 240), suggesting that polyploidy and hybridization are probably important mechanisms of speciation within the group.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Predictive performance evaluation is a fundamental issue in design, development, and deployment of classification systems. As predictive performance evaluation is a multidimensional problem, single scalar summaries such as error rate, although quite convenient due to its simplicity, can seldom evaluate all the aspects that a complete and reliable evaluation must consider. Due to this, various graphical performance evaluation methods are increasingly drawing the attention of machine learning, data mining, and pattern recognition communities. The main advantage of these types of methods resides in their ability to depict the trade-offs between evaluation aspects in a multidimensional space rather than reducing these aspects to an arbitrarily chosen (and often biased) single scalar measure. Furthermore, to appropriately select a suitable graphical method for a given task, it is crucial to identify its strengths and weaknesses. This paper surveys various graphical methods often used for predictive performance evaluation. By presenting these methods in the same framework, we hope this paper may shed some light on deciding which methods are more suitable to use in different situations.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This work proposes and discusses an approach for inducing Bayesian classifiers aimed at balancing the tradeoff between the precise probability estimates produced by time consuming unrestricted Bayesian networks and the computational efficiency of Naive Bayes (NB) classifiers. The proposed approach is based on the fundamental principles of the Heuristic Search Bayesian network learning. The Markov Blanket concept, as well as a proposed ""approximate Markov Blanket"" are used to reduce the number of nodes that form the Bayesian network to be induced from data. Consequently, the usually high computational cost of the heuristic search learning algorithms can be lessened, while Bayesian network structures better than NB can be achieved. The resulting algorithms, called DMBC (Dynamic Markov Blanket Classifier) and A-DMBC (Approximate DMBC), are empirically assessed in twelve domains that illustrate scenarios of particular interest. The obtained results are compared with NB and Tree Augmented Network (TAN) classifiers, and confinn that both proposed algorithms can provide good classification accuracies and better probability estimates than NB and TAN, while being more computationally efficient than the widely used K2 Algorithm.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This paper is concerned with the computational efficiency of fuzzy clustering algorithms when the data set to be clustered is described by a proximity matrix only (relational data) and the number of clusters must be automatically estimated from such data. A fuzzy variant of an evolutionary algorithm for relational clustering is derived and compared against two systematic (pseudo-exhaustive) approaches that can also be used to automatically estimate the number of fuzzy clusters in relational data. An extensive collection of experiments involving 18 artificial and two real data sets is reported and analyzed. (C) 2011 Elsevier B.V. All rights reserved.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

There is a family of well-known external clustering validity indexes to measure the degree of compatibility or similarity between two hard partitions of a given data set, including partitions with different numbers of categories. A unified, fully equivalent set-theoretic formulation for an important class of such indexes was derived and extended to the fuzzy domain in a previous work by the author [Campello, R.J.G.B., 2007. A fuzzy extension of the Rand index and other related indexes for clustering and classification assessment. Pattern Recognition Lett., 28, 833-841]. However, the proposed fuzzy set-theoretic formulation is not valid as a general approach for comparing two fuzzy partitions of data. Instead, it is an approach for comparing a fuzzy partition against a hard referential partition of the data into mutually disjoint categories. In this paper, generalized external indexes for comparing two data partitions with overlapping categories are introduced. These indexes can be used as general measures for comparing two partitions of the same data set into overlapping categories. An important issue that is seldom touched in the literature is also addressed in the paper, namely, how to compare two partitions of different subsamples of data. A number of pedagogical examples and three simulation experiments are presented and analyzed in details. A review of recent related work compiled from the literature is also provided. (c) 2010 Elsevier B.V. All rights reserved.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The substitution of missing values, also called imputation, is an important data preparation task for many domains. Ideally, the substitution of missing values should not insert biases into the dataset. This aspect has been usually assessed by some measures of the prediction capability of imputation methods. Such measures assume the simulation of missing entries for some attributes whose values are actually known. These artificially missing values are imputed and then compared with the original values. Although this evaluation is useful, it does not allow the influence of imputed values in the ultimate modelling task (e.g. in classification) to be inferred. We argue that imputation cannot be properly evaluated apart from the modelling task. Thus, alternative approaches are needed. This article elaborates on the influence of imputed values in classification. In particular, a practical procedure for estimating the inserted bias is described. As an additional contribution, we have used such a procedure to empirically illustrate the performance of three imputation methods (majority, naive Bayes and Bayesian networks) in three datasets. Three classifiers (decision tree, naive Bayes and nearest neighbours) have been used as modelling tools in our experiments. The achieved results illustrate a variety of situations that can take place in the data preparation practice.