15 resultados para optimal feature selection

em Biblioteca Digital da Produção Intelectual da Universidade de São Paulo


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Although nontechnical losses automatic identification has been massively studied, the problem of selecting the most representative features in order to boost the identification accuracy and to characterize possible illegal consumers has not attracted much attention in this context. In this paper, we focus on this problem by reviewing three evolutionary-based techniques for feature selection, and we also introduce one of them in this context. The results demonstrated that selecting the most representative features can improve a lot of the classification accuracy of possible frauds in datasets composed by industrial and commercial profiles.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Abstract Background One goal of gene expression profiling is to identify signature genes that robustly distinguish different types or grades of tumors. Several tumor classifiers based on expression profiling have been proposed using microarray technique. Due to important differences in the probabilistic models of microarray and SAGE technologies, it is important to develop suitable techniques to select specific genes from SAGE measurements. Results A new framework to select specific genes that distinguish different biological states based on the analysis of SAGE data is proposed. The new framework applies the bolstered error for the identification of strong genes that separate the biological states in a feature space defined by the gene expression of a training set. Credibility intervals defined from a probabilistic model of SAGE measurements are used to identify the genes that distinguish the different states with more reliability among all gene groups selected by the strong genes method. A score taking into account the credibility and the bolstered error values in order to rank the groups of considered genes is proposed. Results obtained using SAGE data from gliomas are presented, thus corroborating the introduced methodology. Conclusion The model representing counting data, such as SAGE, provides additional statistical information that allows a more robust analysis. The additional statistical information provided by the probabilistic model is incorporated in the methodology described in the paper. The introduced method is suitable to identify signature genes that lead to a good separation of the biological states using SAGE and may be adapted for other counting methods such as Massive Parallel Signature Sequencing (MPSS) or the recent Sequencing-By-Synthesis (SBS) technique. Some of such genes identified by the proposed method may be useful to generate classifiers.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

This work proposes a system for classification of industrial steel pieces by means of magnetic nondestructive device. The proposed classification system presents two main stages, online system stage and off-line system stage. In online stage, the system classifies inputs and saves misclassification information in order to perform posterior analyses. In the off-line optimization stage, the topology of a Probabilistic Neural Network is optimized by a Feature Selection algorithm combined with the Probabilistic Neural Network to increase the classification rate. The proposed Feature Selection algorithm searches for the signal spectrogram by combining three basic elements: a Sequential Forward Selection algorithm, a Feature Cluster Grow algorithm with classification rate gradient analysis and a Sequential Backward Selection. Also, a trash-data recycling algorithm is proposed to obtain the optimal feedback samples selected from the misclassified ones.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Background: A current challenge in gene annotation is to define the gene function in the context of the network of relationships instead of using single genes. The inference of gene networks (GNs) has emerged as an approach to better understand the biology of the system and to study how several components of this network interact with each other and keep their functions stable. However, in general there is no sufficient data to accurately recover the GNs from their expression levels leading to the curse of dimensionality, in which the number of variables is higher than samples. One way to mitigate this problem is to integrate biological data instead of using only the expression profiles in the inference process. Nowadays, the use of several biological information in inference methods had a significant increase in order to better recover the connections between genes and reduce the false positives. What makes this strategy so interesting is the possibility of confirming the known connections through the included biological data, and the possibility of discovering new relationships between genes when observed the expression data. Although several works in data integration have increased the performance of the network inference methods, the real contribution of adding each type of biological information in the obtained improvement is not clear. Methods: We propose a methodology to include biological information into an inference algorithm in order to assess its prediction gain by using biological information and expression profile together. We also evaluated and compared the gain of adding four types of biological information: (a) protein-protein interaction, (b) Rosetta stone fusion proteins, (c) KEGG and (d) KEGG+GO. Results and conclusions: This work presents a first comparison of the gain in the use of prior biological information in the inference of GNs by considering the eukaryote (P. falciparum) organism. Our results indicates that information based on direct interaction can produce a higher improvement in the gain than data about a less specific relationship as GO or KEGG. Also, as expected, the results show that the use of biological information is a very important approach for the improvement of the inference. We also compared the gain in the inference of the global network and only the hubs. The results indicates that the use of biological information can improve the identification of the most connected proteins.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Hematopoietic cell transplantation (HCT) is an emerging therapy for patients with severe autoimmune diseases (AID). We report data on 368 patients with AID who underwent HCT in 64 North and South American transplantation centers reported to the Center for International Blood and Marrow Transplant Research between 1996 and 2009. Most of the HCTs involved autologous grafts (n = 339); allogeneic HCT (n = 29) was done mostly in children. The most common indications for HCT were multiple sclerosis, systemic sclerosis, and systemic lupus erythematosus. The median age at transplantation was 38 years for autologous HCT and 25 years for allogeneic HCT. The corresponding times from diagnosis to HCT were 35 months and 24 months. Three-year overall survival after autologous HCT was 86% (95% confidence interval [CI], 81%-91%). Median follow-up of survivors was 31 months (range, 1-144 months). The most common causes of death were AID progression, infections, and organ failure. On multivariate analysis, the risk of death was higher in patients at centers that performed fewer than 5 autologous HCTs (relative risk, 3.5; 95% CI, 1.1-11.1; P = .03) and those that performed 5 to 15 autologous HCTs for AID during the study period (relative risk, 4.2; 95% CI, 1.5-11.7; P = .006) compared with patients at centers that performed more than 15 autologous HCTs for AID during the study period. AID is an emerging indication for HCT in the region. Collaboration of hematologists and other disease specialists with an outcomes database is important to promote optimal patient selection, analysis of the impact of prognostic variables and long-term outcomes, and development of clinical trials. Biol Blood Marrow Transplant 18: 1471-1478 (2012) (C) 2012 Published by Elsevier Inc. on behalf of American Society for Blood and Marrow Transplantation

Relevância:

80.00% 80.00%

Publicador:

Resumo:

A set of predictor variables is said to be intrinsically multivariate predictive (IMP) for a target variable if all properly contained subsets of the predictor set are poor predictors of the. target but the full set predicts the target with great accuracy. In a previous article, the main properties of IMP Boolean variables have been analytically described, including the introduction of the IMP score, a metric based on the coefficient of determination (CoD) as a measure of predictiveness with respect to the target variable. It was shown that the IMP score depends on four main properties: logic of connection, predictive power, covariance between predictors and marginal predictor probabilities (biases). This paper extends that work to a broader context, in an attempt to characterize properties of discrete Bayesian networks that contribute to the presence of variables (network nodes) with high IMP scores. We have found that there is a relationship between the IMP score of a node and its territory size, i.e., its position along a pathway with one source: nodes far from the source display larger IMP scores than those closer to the source, and longer pathways display larger maximum IMP scores. This appears to be a consequence of the fact that nodes with small territory have larger probability of having highly covariate predictors, which leads to smaller IMP scores. In addition, a larger number of XOR and NXOR predictive logic relationships has positive influence over the maximum IMP score found in the pathway. This work presents analytical results based on a simple structure network and an analysis involving random networks constructed by computational simulations. Finally, results from a real Bayesian network application are provided. (C) 2012 Elsevier Inc. All rights reserved.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Recent experimental evidence has suggested a neuromodulatory deficit in Alzheimer's disease (AD). In this paper, we present a new electroencephalogram (EEG) based metric to quantitatively characterize neuromodulatory activity. More specifically, the short-term EEG amplitude modulation rate-of-change (i.e., modulation frequency) is computed for five EEG subband signals. To test the performance of the proposed metric, a classification task was performed on a database of 32 participants partitioned into three groups of approximately equal size: healthy controls, patients diagnosed with mild AD, and those with moderate-to-severe AD. To gauge the benefits of the proposed metric, performance results were compared with those obtained using EEG spectral peak parameters which were recently shown to outperform other conventional EEG measures. Using a simple feature selection algorithm based on area-under-the-curve maximization and a support vector machine classifier, the proposed parameters resulted in accuracy gains, relative to spectral peak parameters, of 21.3% when discriminating between the three groups and by 50% when mild and moderate-to-severe groups were merged into one. The preliminary findings reported herein provide promising insights that automated tools may be developed to assist physicians in very early diagnosis of AD as well as provide researchers with a tool to automatically characterize cross-frequency interactions and their changes with disease.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Several recent studies in literature have identified brain morphological alterations associated to Borderline Personality Disorder (BPD) patients. These findings are reported by studies based on voxel-based-morphometry analysis of structural MRI data, comparing mean gray-matter concentration between groups of BPD patients and healthy controls. On the other hand, mean differences between groups are not informative about the discriminative value of neuroimaging data to predict the group of individual subjects. In this paper, we go beyond mean differences analyses, and explore to what extent individual BPD patients can be differentiated from controls (25 subjects in each group), using a combination of automated-morphometric tools for regional cortical thickness/volumetric estimation and Support Vector Machine classifier. The approach included a feature selection step in order to identify the regions containing most discriminative information. The accuracy of this classifier was evaluated using the leave-one-subject-out procedure. The brain regions indicated as containing relevant information to discriminate groups were the orbitofrontal, rostral anterior cingulate, posterior cingulate, middle temporal cortices, among others. These areas, which are distinctively involved in emotional and affect regulation of BPD patients, were the most informative regions to achieve both sensitivity and specificity values of 80% in SVM classification. The findings suggest that this new methodology can add clinical and potential diagnostic value to neuroimaging of psychiatric disorders. (C) 2012 Elsevier Ltd. All rights reserved.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

In this paper, we perform a thorough analysis of a spectral phase-encoded time spreading optical code division multiple access (SPECTS-OCDMA) system based on Walsh-Hadamard (W-H) codes aiming not only at finding optimal code-set selections but also at assessing its loss of security due to crosstalk. We prove that an inadequate choice of codes can make the crosstalk between active users to become large enough so as to cause the data from the user of interest to be detected by other user. The proposed algorithm for code optimization targets code sets that produce minimum bit error rate (BER) among all codes for a specific number of simultaneous users. This methodology allows us to find optimal code sets for any OCDMA system, regardless the code family used and the number of active users. This procedure is crucial for circumventing the unexpected lack of security due to crosstalk. We also show that a SPECTS-OCDMA system based on W-H 32(64) fundamentally limits the number of simultaneous users to 4(8) with no security violation due to crosstalk. More importantly, we prove that only a small fraction of the available code sets is actually immune to crosstalk with acceptable BER (<10(-9)) i.e., approximately 0.5% for W-H 32 with four simultaneous users, and about 1 x 10(-4)% for W-H 64 with eight simultaneous users.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The starting point of this article is the question "How to retrieve fingerprints of rhythm in written texts?" We address this problem in the case of Brazilian and European Portuguese. These two dialects of Modern Portuguese share the same lexicon and most of the sentences they produce are superficially identical. Yet they are conjectured, on linguistic grounds, to implement different rhythms. We show that this linguistic question can be formulated as a problem of model selection in the class of variable length Markov chains. To carry on this approach, we compare texts from European and Brazilian Portuguese. These texts are previously encoded according to some basic rhythmic features of the sentences which can be automatically retrieved. This is an entirely new approach from the linguistic point of view. Our statistical contribution is the introduction of the smallest maximizer criterion which is a constant free procedure for model selection. As a by-product, this provides a solution for the problem of optimal choice of the penalty constant when using the BIC to select a variable length Markov chain. Besides proving the consistency of the smallest maximizer criterion when the sample size diverges, we also make a simulation study comparing our approach with both the standard BIC selection and the Peres-Shields order estimation. Applied to the linguistic sample constituted for our case study, the smallest maximizer criterion assigns different context-tree models to the two dialects of Portuguese. The features of the selected models are compatible with current conjectures discussed in the linguistic literature.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Araucaria angustifolia, a unique species of this genus that occurs naturally in Brazil, has a high socio-economic and environmental value and is critically endangered of extinction, since it has been submitted to intense predatory exploitation during the last century. Root-associated bacteria from A. angustifolia were isolated, selected and characterized for their biotechnological potential of growth promotion and biocontrol of plant pathogenic fungi. Ninety-seven strains were isolated and subjected to chemical tests. All isolates presented at least one positive feature, characterizing them as potential PGPR. Eighteen isolates produced indole-3-acetic acid (IAA), 27 were able to solubilize inorganic phosphate, 21 isolates were presumable diazotrophs, with pellicle formation in nitrogen-free culture medium, 83 were phosphatases producers, 37 were positive for siderophores and 45 endospore-forming isolates were antagonistic to Fusarium oxysporum, a pathogen of conifers. We also observed the presence of bacterial strains with multiple beneficial mechanisms of action. Analyzing the fatty acid methyl ester (FAME) and partial sequencing of the 16S rRNA gene of these isolates, it was possible to characterize the most effective isolates as belonging to Bacillaceae (9 isolates), Enterobacteriaceae (11) and Pseudomonadaceae (1). As far as we know, this is the first study to include the species Ewingella americana as a PGPR. (C) 2011 Elsevier GmbH. All rights reserved.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Purpose: To estimate the metabolic activity of rectal cancers at 6 and 12 weeks after completion of chemoradiation therapy (CRT) by 2-[fluorine-18] fluoro-2-deoxy-D-glucose-labeled positron emission tomography/computed tomography ([18 FDG] PET/CT) imaging and correlate with response to CRT. Methods and Materials: Patients with cT2-4N0-2M0 distal rectal adenocarcinoma treated with long-course neoadjuvant CRT (54 Gy, 5-fluouracil-based) were prospectively studied (ClinicalTrials. org identifier NCT00254683). All patients underwent 3 PET/CT studies (at baseline and 6 and 12 weeks fromCRT completion). Clinical assessment was at 12 weeks. Maximal standard uptakevalue (SUVmax) of the primary tumor wasmeasured and recorded at eachPET/CTstudy after 1 h (early) and3 h (late) from 18 FDGinjection. Patientswith an increase in early SUVmax between 6 and 12 weeks were considered " bad" responders and the others as "good" responders. Results: Ninety-one patients were included; 46 patients (51%) were "bad" responders, whereas 45 (49%) patients were " good" responders. " Bad" responders were less likely to develop complete clinical response (6.5% vs. 37.8%, respectively; PZ. 001), less likely to develop significant histological tumor regression (complete or near-complete pathological response; 16% vs. 45%, respectively; PZ. 008) and exhibited greater final tumor dimension (4.3cmvs. 3.3cm; PZ. 03). Decrease between early (1 h) and late (3 h) SUVmax at 6-week PET/CTwas a significant predictor of " good" response (accuracy of 67%). Conclusions: Patients who developed an increase in SUVmax after 6 weeks were less likely to develop significant tumor downstaging. Early-late SUVmax variation at 6-week PET/CT may help identify these patients and allow tailored selection of CRT-surgery intervals for individual patients. (C) 2012 Elsevier Inc.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Various factors are believed to govern the selection of references in citation networks, but a precise, quantitative determination of their importance has remained elusive. In this paper, we show that three factors can account for the referencing pattern of citation networks for two topics, namely "graphenes" and "complex networks", thus allowing one to reproduce the topological features of the networks built with papers being the nodes and the edges established by citations. The most relevant factor was content similarity, while the other two - in-degree (i.e. citation counts) and age of publication - had varying importance depending on the topic studied. This dependence indicates that additional factors could play a role. Indeed, by intuition one should expect the reputation (or visibility) of authors and/or institutions to affect the referencing pattern, and this is only indirectly considered via the in-degree that should correlate with such reputation. Because information on reputation is not readily available, we simulated its effect on artificial citation networks considering two communities with distinct fitness (visibility) parameters. One community was assumed to have twice the fitness value of the other, which amounts to a double probability for a paper being cited. While the h-index for authors in the community with larger fitness evolved with time with slightly higher values than for the control network (no fitness considered), a drastic effect was noted for the community with smaller fitness. (C) 2012 Elsevier Ltd. All rights reserved.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In this paper, we consider the stochastic optimal control problem of discrete-time linear systems subject to Markov jumps and multiplicative noises under two criteria. The first one is an unconstrained mean-variance trade-off performance criterion along the time, and the second one is a minimum variance criterion along the time with constraints on the expected output. We present explicit conditions for the existence of an optimal control strategy for the problems, generalizing previous results in the literature. We conclude the paper by presenting a numerical example of a multi-period portfolio selection problem with regime switching in which it is desired to minimize the sum of the variances of the portfolio along the time under the restriction of keeping the expected value of the portfolio greater than some minimum values specified by the investor. (C) 2011 Elsevier Ltd. All rights reserved.