991 resultados para Variables selection
Resumo:
The aim of this work is to establish a relationship between schistosomiasis prevalence and social-environmental variables, in the state of Minas Gerais, Brazil, through multiple linear regression. The final regression model was established, after a variables selection phase, with a set of spatial variables which contains the summer minimum temperature, human development index, and vegetation type variables. Based on this model, a schistosomiasis risk map was built for Minas Gerais.
Resumo:
In this work, the quantitative analysis of glucose, triglycerides and cholesterol (total and HDL) in both rat and human blood plasma was performed without any kind of pretreatment of samples, by using near infrared spectroscopy (NIR) combined with multivariate methods. For this purpose, different techniques and algorithms used to pre-process data, to select variables and to build multivariate regression models were compared between each other, such as partial least squares regression (PLS), non linear regression by artificial neural networks, interval partial least squares regression (iPLS), genetic algorithm (GA), successive projections algorithm (SPA), amongst others. Related to the determinations of rat blood plasma samples, the variables selection algorithms showed satisfactory results both for the correlation coefficients (R²) and for the values of root mean square error of prediction (RMSEP) for the three analytes, especially for triglycerides and cholesterol-HDL. The RMSEP values for glucose, triglycerides and cholesterol-HDL obtained through the best PLS model were 6.08, 16.07 e 2.03 mg dL-1, respectively. In the other case, for the determinations in human blood plasma, the predictions obtained by the PLS models provided unsatisfactory results with non linear tendency and presence of bias. Then, the ANN regression was applied as an alternative to PLS, considering its ability of modeling data from non linear systems. The root mean square error of monitoring (RMSEM) for glucose, triglycerides and total cholesterol, for the best ANN models, were 13.20, 10.31 e 12.35 mg dL-1, respectively. Statistical tests (F and t) suggest that NIR spectroscopy combined with multivariate regression methods (PLS and ANN) are capable to quantify the analytes (glucose, triglycerides and cholesterol) even when they are present in highly complex biological fluids, such as blood plasma
Resumo:
The aim of this study was to evaluate the potential of near-infrared reflectance spectroscopy (NIRS) as a rapid and non-destructive method to determine the soluble solid content (SSC), pH and titratable acidity of intact plums. Samples of plum with a total solids content ranging from 5.7 to 15%, pH from 2.72 to 3.84 and titratable acidity from 0.88 a 3.6% were collected from supermarkets in Natal-Brazil, and NIR spectra were acquired in the 714 2500 nm range. A comparison of several multivariate calibration techniques with respect to several pre-processing data and variable selection algorithms, such as interval Partial Least Squares (iPLS), genetic algorithm (GA), successive projections algorithm (SPA) and ordered predictors selection (OPS), was performed. Validation models for SSC, pH and titratable acidity had a coefficient of correlation (R) of 0.95 0.90 and 0.80, as well as a root mean square error of prediction (RMSEP) of 0.45ºBrix, 0.07 and 0.40%, respectively. From these results, it can be concluded that NIR spectroscopy can be used as a non-destructive alternative for measuring the SSC, pH and titratable acidity in plums
Resumo:
Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)
Resumo:
We present a model of Bayesian network for continuous variables, where densities and conditional densities are estimated with B-spline MoPs. We use a novel approach to directly obtain conditional densities estimation using B-spline properties. In particular we implement naive Bayes and wrapper variables selection. Finally we apply our techniques to the problem of predicting neurons morphological variables from electrophysiological ones.
Resumo:
In cluster analysis, it can be useful to interpret the partition built from the data in the light of external categorical variables which are not directly involved to cluster the data. An approach is proposed in the model-based clustering context to select a number of clusters which both fits the data well and takes advantage of the potential illustrative ability of the external variables. This approach makes use of the integrated joint likelihood of the data and the partitions at hand, namely the model-based partition and the partitions associated to the external variables. It is noteworthy that each mixture model is fitted by the maximum likelihood methodology to the data, excluding the external variables which are used to select a relevant mixture model only. Numerical experiments illustrate the promising behaviour of the derived criterion. © 2014 Springer-Verlag Berlin Heidelberg.
Resumo:
In cluster analysis, it can be useful to interpret the partition built from the data in the light of external categorical variables which are not directly involved to cluster the data. An approach is proposed in the model-based clustering context to select a number of clusters which both fits the data well and takes advantage of the potential illustrative ability of the external variables. This approach makes use of the integrated joint likelihood of the data and the partitions at hand, namely the model-based partition and the partitions associated to the external variables. It is noteworthy that each mixture model is fitted by the maximum likelihood methodology to the data, excluding the external variables which are used to select a relevant mixture model only. Numerical experiments illustrate the promising behaviour of the derived criterion.
Resumo:
The growing demand for steels with tighter compositional specifications led the Companhia Siderúrgica Nacional (CSN) to develop more efficient processes. To solve this problem this paper aims to identify the operational variables more impacting in the desulfurization process, specifically in torpedo car, as well as its causes and solutions. Then select and test, with laboratorial and industrial tests, desulfurizing agents based of CaC 2, CaO, CaCO3, and Mg to assess the cost per quantity of product desulfurized. The mixture with best results was not that one with highest content of CaC2. It is believed that this mixture showed better efficiency because of the increased agitation of the bath, produced by the releasing of gas from compound CaCO3 present in this mixture. Copyright © 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Resumo:
In a matched experimental design, the effectiveness of matching in reducing bias and increasing power depends on the strength of the association between the matching variable and the outcome of interest. In particular, in the design of a community health intervention trial, the effectiveness of a matched design, where communities are matched according to some community characteristic, depends on the strength of the correlation between the matching characteristic and the change in the health behavior being measured. We attempt to estimate the correlation between community characteristics and changes in health behaviors in four datasets from community intervention trials and observational studies. Community characteristics that are highly correlated with changes in health behaviors would potentially be effective matching variables in studies of health intervention programs designed to change those behaviors. Among the community characteristics considered, the urban-rural character of the community was the most highly correlated with changes in health behaviors. The correlations between Per Capita Income, Percent Low Income & Percent aged over 65 and changes in health behaviors were marginally statistically significant (p < 0.08).
Resumo:
Body fat distribution is a cardiovascular health risk factor in adults. Body fat distribution can be measured through various methods including anthropometry. It is not clear which anthropometric index is suitable for epidemiologic studies of fat distribution and cardiovascular disease. The purpose of the present study was to select a measure of body fat distribution from among a series of indices (those traditionally used in the literature and others constructed from the analysis) that is most highly correlated with lipid-related variables and is independent of overall fatness. Subjects were Mexican-American men and women (N = 1004) from a study of gallbladder disease in Starr County, Texas. Multivariate associations were sought between lipid profile measures (lipids, lipoproteins, and apolipoproteins) and two sets of anthropometric variables (4 circumferences and 6 skinfolds). This was done to assess the association between lipid-related measures and the two sets of anthropometric variables and guide the construction of indices.^ Two indices emerged from the analysis that seemed to be highly correlated with lipid profile measures independent of obesity. These indices are: 2*arm circumference-thigh skinfold in pre- and post-menopausal women and arm/thigh circumference ratio in men. Next, using the sum of all skinfolds to represent obesity and the selected body fat distribution indices, the following hypotheses were tested: (1) state of obesity and centrally/upper distributed body fat are equally predictive of lipids, lipoproteins and apolipoproteins, and (2) the correlation among the lipid-related measures is not altered by obesity and body fat distribution.^ With respect to the first hypothesis, the present study found that most lipids, lipoproteins and apolipoproteins were significantly associated with both overall fatness and anatomical location of body fat in both sex and menopausal groups. However, within men and post-menopausal women, certain lipid profile measures (triglyceride and HDLT among post-menopausal women and apos C-II, CIII, and E among men) had substantially higher correlation with body fat distribution as compared with overall fatness.^ With respect to the second hypothesis, both obesity and body fat distribution were found to alter the association among plasma lipid variables in men and women. There was a suggestion from the data that the pattern of correlations among men and post-menopausal women are more comparable. Among men correlations involving apo A-I, HDLT, and HDL$\sb2$ seemed greatly influenced by obesity, and A-II by fat distribution; among post-menopausal women correlations involving apos A-I and A-II were highly affected by the location of body fat.^ Thus, these data point out that not only can obesity and fat distribution affect levels of single measures, they also can markedly influence the pattern of relationship among measures. The fact that such changes are seen for both obesity and fat distribution is significant, since the indices employed were chosen because they were independent of one another. ^
Resumo:
This paper studies feature subset selection in classification using a multiobjective estimation of distribution algorithm. We consider six functions, namely area under ROC curve, sensitivity, specificity, precision, F1 measure and Brier score, for evaluation of feature subsets and as the objectives of the problem. One of the characteristics of these objective functions is the existence of noise in their values that should be appropriately handled during optimization. Our proposed algorithm consists of two major techniques which are specially designed for the feature subset selection problem. The first one is a solution ranking method based on interval values to handle the noise in the objectives of this problem. The second one is a model estimation method for learning a joint probabilistic model of objectives and variables which is used to generate new solutions and advance through the search space. To simplify model estimation, l1 regularized regression is used to select a subset of problem variables before model learning. The proposed algorithm is compared with a well-known ranking method for interval-valued objectives and a standard multiobjective genetic algorithm. Particularly, the effects of the two new techniques are experimentally investigated. The experimental results show that the proposed algorithm is able to obtain comparable or better performance on the tested datasets.
Resumo:
Errata sheet inserted.
Resumo:
Negative-ion mode electrospray ionization, ESI(-), with Fourier transform ion cyclotron resonance mass spectrometry (FT-ICR MS) was coupled to a Partial Least Squares (PLS) regression and variable selection methods to estimate the total acid number (TAN) of Brazilian crude oil samples. Generally, ESI(-)-FT-ICR mass spectra present a power of resolution of ca. 500,000 and a mass accuracy less than 1 ppm, producing a data matrix containing over 5700 variables per sample. These variables correspond to heteroatom-containing species detected as deprotonated molecules, [M - H](-) ions, which are identified primarily as naphthenic acids, phenols and carbazole analog species. The TAN values for all samples ranged from 0.06 to 3.61 mg of KOH g(-1). To facilitate the spectral interpretation, three methods of variable selection were studied: variable importance in the projection (VIP), interval partial least squares (iPLS) and elimination of uninformative variables (UVE). The UVE method seems to be more appropriate for selecting important variables, reducing the dimension of the variables to 183 and producing a root mean square error of prediction of 0.32 mg of KOH g(-1). By reducing the size of the data, it was possible to relate the selected variables with their corresponding molecular formulas, thus identifying the main chemical species responsible for the TAN values.
Resumo:
Background: Feature selection is a pattern recognition approach to choose important variables according to some criteria in order to distinguish or explain certain phenomena (i.e., for dimensionality reduction). There are many genomic and proteomic applications that rely on feature selection to answer questions such as selecting signature genes which are informative about some biological state, e. g., normal tissues and several types of cancer; or inferring a prediction network among elements such as genes, proteins and external stimuli. In these applications, a recurrent problem is the lack of samples to perform an adequate estimate of the joint probabilities between element states. A myriad of feature selection algorithms and criterion functions have been proposed, although it is difficult to point the best solution for each application. Results: The intent of this work is to provide an open-source multiplataform graphical environment for bioinformatics problems, which supports many feature selection algorithms, criterion functions and graphic visualization tools such as scatterplots, parallel coordinates and graphs. A feature selection approach for growing genetic networks from seed genes ( targets or predictors) is also implemented in the system. Conclusion: The proposed feature selection environment allows data analysis using several algorithms, criterion functions and graphic visualization tools. Our experiments have shown the software effectiveness in two distinct types of biological problems. Besides, the environment can be used in different pattern recognition applications, although the main concern regards bioinformatics tasks.