914 resultados para Data selection


Relevância:

40.00% 40.00%

Publicador:

Resumo:

This paper investigates the gene selection problem for microarray data with small samples and variant correlation. Most existing algorithms usually require expensive computational effort, especially under thousands of gene conditions. The main objective of this paper is to effectively select the most informative genes from microarray data, while making the computational expenses affordable. This is achieved by proposing a novel forward gene selection algorithm (FGSA). To overcome the small samples' problem, the augmented data technique is firstly employed to produce an augmented data set. Taking inspiration from other gene selection methods, the L2-norm penalty is then introduced into the recently proposed fast regression algorithm to achieve the group selection ability. Finally, by defining a proper regression context, the proposed method can be fast implemented in the software, which significantly reduces computational burden. Both computational complexity analysis and simulation results confirm the effectiveness of the proposed algorithm in comparison with other approaches

Relevância:

40.00% 40.00%

Publicador:

Resumo:

his paper considers a problem of identification for a high dimensional nonlinear non-parametric system when only a limited data set is available. The algorithms are proposed for this purpose which exploit the relationship between the input variables and the output and further the inter-dependence of input variables so that the importance of the input variables can be established. A key to these algorithms is the non-parametric two stage input selection algorithm.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

For years, choosing the right career by monitoring the trends and scope for different career paths have been a requirement for all youngsters all over the world. In this paper we provide a scientific, data mining based method for job absorption rate prediction and predicting the waiting time needed for 100% placement, for different engineering courses in India. This will help the students in India in a great deal in deciding the right discipline for them for a bright future. Information about passed out students are obtained from the NTMIS ( National technical manpower information system ) NODAL center in Kochi, India residing in Cochin University of science and technology

Relevância:

40.00% 40.00%

Publicador:

Resumo:

In the continuing debate over the impact of genetically modified (GM) crops on farmers of developing countries, it is important to accurately measure magnitudes such as farm-level yield gains from GM crop adoption. Yet most farm-level studies in the literature do not control for farmer self-selection, a potentially important source of bias in such estimates. We use farm-level panel data from Indian cotton farmers to investigate the yield effect of GM insect-resistant cotton. We explicitly take into account the fact that the choice of crop variety is an endogenous variable which might lead to bias from self-selection. A production function is estimated using a fixed-effects model to control for selection bias. Our results show that efficient farmers adopt Bacillus thuringiensis (Bt) cotton at a higher rate than their less efficient peers. This suggests that cross-sectional estimates of the yield effect of Bt cotton, which do not control for self-selection effects, are likely to be biased upwards. However, after controlling for selection bias, we still find that there is a significant positive yield effect from adoption of Bt cotton that more than offsets the additional cost of Bt seed.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

In the continuing debate over the impact of genetically modified (GM) crops on farmers of developing countries, it is important to accurately measure magnitudes such as farm-level yield gains from GM crop adoption. Yet most farm-level studies in the literature do not control for farmer self-selection, a potentially important source of bias in such estimates. We use farm-level panel data from Indian cotton farmers to investigate the yield effect of GM insect-resistant cotton. We explicitly take into account the fact that the choice of crop variety is an endogenous variable which might lead to bias from self-selection. A production function is estimated using a fixed-effects model to control for selection bias. Our results show that efficient farmers adopt Bacillus thuringiensis (Bt) cotton at a higher rate than their less efficient peers. This suggests that cross-sectional estimates of the yield effect of Bt cotton, which do not control for self-selection effects, are likely to be biased upwards. However, after controlling for selection bias, we still find that there is a significant positive yield effect from adoption of Bt cotton that more than offsets the additional cost of Bt seed.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Background: Affymetrix GeneChip arrays are widely used for transcriptomic studies in a diverse range of species. Each gene is represented on a GeneChip array by a probe- set, consisting of up to 16 probe-pairs. Signal intensities across probe- pairs within a probe-set vary in part due to different physical hybridisation characteristics of individual probes with their target labelled transcripts. We have previously developed a technique to study the transcriptomes of heterologous species based on hybridising genomic DNA (gDNA) to a GeneChip array designed for a different species, and subsequently using only those probes with good homology. Results: Here we have investigated the effects of hybridising homologous species gDNA to study the transcriptomes of species for which the arrays have been designed. Genomic DNA from Arabidopsis thaliana and rice (Oryza sativa) were hybridised to the Affymetrix Arabidopsis ATH1 and Rice Genome GeneChip arrays respectively. Probe selection based on gDNA hybridisation intensity increased the number of genes identified as significantly differentially expressed in two published studies of Arabidopsis development, and optimised the analysis of technical replicates obtained from pooled samples of RNA from rice. Conclusion: This mixed physical and bioinformatics approach can be used to optimise estimates of gene expression when using GeneChip arrays.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Recent studies showed that features extracted from brain MRIs can well discriminate Alzheimer’s disease from Mild Cognitive Impairment. This study provides an algorithm that sequentially applies advanced feature selection methods for findings the best subset of features in terms of binary classification accuracy. The classifiers that provided the highest accuracies, have been then used for solving a multi-class problem by the one-versus-one strategy. Although several approaches based on Regions of Interest (ROIs) extraction exist, the prediction power of features has not yet investigated by comparing filter and wrapper techniques. The findings of this work suggest that (i) the IntraCranial Volume (ICV) normalization can lead to overfitting and worst the accuracy prediction of test set and (ii) the combined use of a Random Forest-based filter with a Support Vector Machines-based wrapper, improves accuracy of binary classification.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Seamless phase II/III clinical trials are conducted in two stages with treatment selection at the first stage. In the first stage, patients are randomized to a control or one of k > 1 experimental treatments. At the end of this stage, interim data are analysed, and a decision is made concerning which experimental treatment should continue to the second stage. If the primary endpoint is observable only after some period of follow-up, at the interim analysis data may be available on some early outcome on a larger number of patients than those for whom the primary endpoint is available. These early endpoint data can thus be used for treatment selection. For two previously proposed approaches, the power has been shown to be greater for one or other method depending on the true treatment effects and correlations. We propose a new approach that builds on the previously proposed approaches and uses data available at the interim analysis to estimate these parameters and then, on the basis of these estimates, chooses the treatment selection method with the highest probability of correctly selecting the most effective treatment. This method is shown to perform well compared with the two previously described methods for a wide range of true parameter values. In most cases, the performance of the new method is either similar to or, in some cases, better than either of the two previously proposed methods.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Phylogenetic analyses of chloroplast DNA sequences, morphology, and combined data have provided consistent support for many of the major branches within the angiosperm, clade Dipsacales. Here we use sequences from three mitochondrial loci to test the existing broad scale phylogeny and in an attempt to resolve several relationships that have remained uncertain. Parsimony, maximum likelihood, and Bayesian analyses of a combined mitochondrial data set recover trees broadly consistent with previous studies, although resolution and support are lower than in the largest chloroplast analyses. Combining chloroplast and mitochondrial data results in a generally well-resolved and very strongly supported topology but the previously recognized problem areas remain. To investigate why these relationships have been difficult to resolve we conducted a series of experiments using different data partitions and heterogeneous substitution models. Usually more complex modeling schemes are favored regardless of the partitions recognized but model choice had little effect on topology or support values. In contrast there are consistent but weakly supported differences in the topologies recovered from coding and non-coding matrices. These conflicts directly correspond to relationships that were poorly resolved in analyses of the full combined chloroplast-mitochondrial data set. We suggest incongruent signal has contributed to our inability to confidently resolve these problem areas. (c) 2007 Elsevier Inc. All rights reserved.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Microarray data classification is one of the most important emerging clinical applications in the medical community. Machine learning algorithms are most frequently used to complete this task. We selected one of the state-of-the-art kernel-based algorithms, the support vector machine (SVM), to classify microarray data. As a large number of kernels are available, a significant research question is what is the best kernel for patient diagnosis based on microarray data classification using SVM? We first suggest three solutions based on data visualization and quantitative measures. Different types of microarray problems then test the proposed solutions. Finally, we found that the rule-based approach is most useful for automatic kernel selection for SVM to classify microarray data.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

The Generalized Estimating Equations (GEE) method is one of the most commonly used statistical methods for the analysis of longitudinal data in epidemiological studies. A working correlation structure for the repeated measures of the outcome variable of a subject needs to be specified by this method. However, statistical criteria for selecting the best correlation structure and the best subset of explanatory variables in GEE are only available recently because the GEE method is developed on the basis of quasi-likelihood theory. Maximum likelihood based model selection methods, such as the widely used Akaike Information Criterion (AIC), are not applicable to GEE directly. Pan (2001) proposed a selection method called QIC which can be used to select the best correlation structure and the best subset of explanatory variables. Based on the QIC method, we developed a computing program to calculate the QIC value for a range of different distributions, link functions and correlation structures. This program was written in Stata software. In this article, we introduce this program and demonstrate how to use it to select the most parsimonious model in GEE analyses of longitudinal data through several representative examples.