861 resultados para Linear-regression
Resumo:
Locally weighted regression is a technique that predicts the response for new data items from their neighbors in the training data set, where closer data items are assigned higher weights in the prediction. However, the original method may suffer from overfitting and fail to select the relevant variables. In this paper we propose combining a regularization approach with locally weighted regression to achieve sparse models. Specifically, the lasso is a shrinkage and selection method for linear regression. We present an algorithm that embeds lasso in an iterative procedure that alternatively computes weights and performs lasso-wise regression. The algorithm is tested on three synthetic scenarios and two real data sets. Results show that the proposed method outperforms linear and local models for several kinds of scenarios
Resumo:
Specific cutting energy (SE) has been widely used to assess the rock cuttability for mechanical excavation purposes. Some prediction models were developed for SE through correlating rock properties with SE values. However, some of the textural and compositional rock parameters i.e. texture coefficient and feldspar, mafic, and felsic mineral contents were not considered. The present study is to investigate the effects of previously ignored rock parameters along with engineering rock properties on SE. Mineralogical and petrographic analyses, rock mechanics, and linear rock cutting tests were performed on sandstone samples taken from sites around Ankara, Turkey. Relationships between SE and rock properties were evaluated using bivariate correlation and linear regression analyses. The tests and subsequent analyses revealed that the texture coefficient and feldspar content of sandstones affected rock cuttability, evidenced by significant correlations between these parameters and SE at a 90% confidence level. Felsic and mafic mineral contents of sandstones did not exhibit any statistically significant correlation against SE. Cementation coefficient, effective porosity, and pore volume had good correlations against SE. Poisson's ratio, Brazilian tensile strength, Shore scleroscope hardness, Schmidt hammer hardness, dry density, and point load strength index showed very strong linear correlations against SE at confidence levels of 95% and above, all of which were also found suitable to be used in predicting SE individually, depending on the results of regression analysis, ANOVA, Student's t-tests, and R2 values. Poisson's ratio exhibited the highest correlation with SE and seemed to be the most reliable SE prediction tool in sandstones.
Resumo:
Specific cutting energy (SE) has been widely used to assess the rock cuttability for mechanical excavation purposes. Some prediction models were developed for SE through correlating rock properties with SE values. However, some of the textural and compositional rock parameters i.e. texture coefficient and feldspar, mafic, and felsic mineral contents were not considered. The present study is to investigate the effects of previously ignored rock parameters along with engineering rock properties on SE. Mineralogical and petrographic analyses, rock mechanics, and linear rock cutting tests were performed on sandstone samples taken from sites around Ankara, Turkey. Relationships between SE and rock properties were evaluated using bivariate correlation and linear regression analyses. The tests and subsequent analyses revealed that the texture coefficient and feldspar content of sandstones affected rock cuttability, evidenced by significant correlations between these parameters and SE at a 90% confidence level. Felsic and mafic mineral contents of sandstones did not exhibit any statistically significant correlation against SE. Cementation coefficient, effective porosity, and pore volume had good correlations against SE. Poisson's ratio, Brazilian tensile strength, Shore scleroscope hardness, Schmidt hammer hardness, dry density, and point load strength index showed very strong linear correlations against SE at confidence levels of 95% and above, all of which were also found suitable to be used in predicting SE individually, depending on the results of regression analysis, ANOVA, Student's t-tests, and R-2 values. Poisson's ratio exhibited the highest correlation with SE and seemed to be the most reliable SE prediction tool in sandstones.
Resumo:
Background: The residue-wise contact order (RWCO) describes the sequence separations between the residues of interest and its contacting residues in a protein sequence. It is a new kind of one-dimensional protein structure that represents the extent of long-range contacts and is considered as a generalization of contact order. Together with secondary structure, accessible surface area, the B factor, and contact number, RWCO provides comprehensive and indispensable important information to reconstructing the protein three-dimensional structure from a set of one-dimensional structural properties. Accurately predicting RWCO values could have many important applications in protein three-dimensional structure prediction and protein folding rate prediction, and give deep insights into protein sequence-structure relationships. Results: We developed a novel approach to predict residue-wise contact order values in proteins based on support vector regression (SVR), starting from primary amino acid sequences. We explored seven different sequence encoding schemes to examine their effects on the prediction performance, including local sequence in the form of PSI-BLAST profiles, local sequence plus amino acid composition, local sequence plus molecular weight, local sequence plus secondary structure predicted by PSIPRED, local sequence plus molecular weight and amino acid composition, local sequence plus molecular weight and predicted secondary structure, and local sequence plus molecular weight, amino acid composition and predicted secondary structure. When using local sequences with multiple sequence alignments in the form of PSI-BLAST profiles, we could predict the RWCO distribution with a Pearson correlation coefficient (CC) between the predicted and observed RWCO values of 0.55, and root mean square error (RMSE) of 0.82, based on a well-defined dataset with 680 protein sequences. Moreover, by incorporating global features such as molecular weight and amino acid composition we could further improve the prediction performance with the CC to 0.57 and an RMSE of 0.79. In addition, combining the predicted secondary structure by PSIPRED was found to significantly improve the prediction performance and could yield the best prediction accuracy with a CC of 0.60 and RMSE of 0.78, which provided at least comparable performance compared with the other existing methods. Conclusion: The SVR method shows a prediction performance competitive with or at least comparable to the previously developed linear regression-based methods for predicting RWCO values. In contrast to support vector classification (SVC), SVR is very good at estimating the raw value profiles of the samples. The successful application of the SVR approach in this study reinforces the fact that support vector regression is a powerful tool in extracting the protein sequence-structure relationship and in estimating the protein structural profiles from amino acid sequences.
Resumo:
Correlation and regression are two of the statistical procedures most widely used by optometrists. However, these tests are often misused or interpreted incorrectly, leading to erroneous conclusions from clinical experiments. This review examines the major statistical tests concerned with correlation and regression that are most likely to arise in clinical investigations in optometry. First, the use, interpretation and limitations of Pearson's product moment correlation coefficient are described. Second, the least squares method of fitting a linear regression to data and for testing how well a regression line fits the data are described. Third, the problems of using linear regression methods in observational studies, if there are errors associated in measuring the independent variable and for predicting a new value of Y for a given X, are discussed. Finally, methods for testing whether a non-linear relationship provides a better fit to the data and for comparing two or more regression lines are considered.
Resumo:
Fitting a linear regression to data provides much more information about the relationship between two variables than a simple correlation test. A goodness of fit test of the line should always be carried out. Hence, ‘r squared’ estimates the strength of the relationship between Y and X, ANOVA whether a statistically significant line is present, and the ‘t’ test whether the slope of the line is significantly different from zero. In addition, it is important to check whether the data fit the assumptions for regression analysis and, if not, whether a transformation of the Y and/or X variables is necessary.
Resumo:
1. Fitting a linear regression to data provides much more information about the relationship between two variables than a simple correlation test. A goodness of fit test of the line should always be carried out. Hence, r squared estimates the strength of the relationship between Y and X, ANOVA whether a statistically significant line is present, and the ‘t’ test whether the slope of the line is significantly different from zero. 2. Always check whether the data collected fit the assumptions for regression analysis and, if not, whether a transformation of the Y and/or X variables is necessary. 3. If the regression line is to be used for prediction, it is important to determine whether the prediction involves an individual y value or a mean. Care should be taken if predictions are made close to the extremities of the data and are subject to considerable error if x falls beyond the range of the data. Multiple predictions require correction of the P values. 3. If several individual regression lines have been calculated from a number of similar sets of data, consider whether they should be combined to form a single regression line. 4. If the data exhibit a degree of curvature, then fitting a higher-order polynomial curve may provide a better fit than a straight line. In this case, a test of whether the data depart significantly from a linear regression should be carried out.
Resumo:
In previous statnotes, the application of correlation and regression methods to the analysis of two variables (X,Y) was described. These methods can be used to determine whether there is a linear relationship between the two variables, whether the relationship is positive or negative, to test the degree of significance of the linear relationship, and to obtain an equation relating Y to X. This Statnote extends the methods of linear correlation and regression to situations where there are two or more X variables, i.e., 'multiple linear regression’.
Resumo:
Purpose: To determine whether curve-fitting analysis of the ranked segment distributions of topographic optic nerve head (ONH) parameters, derived using the Heidelberg Retina Tomograph (HRT), provide a more effective statistical descriptor to differentiate the normal from the glaucomatous ONH. Methods: The sample comprised of 22 normal control subjects (mean age 66.9 years; S.D. 7.8) and 22 glaucoma patients (mean age 72.1 years; S.D. 6.9) confirmed by reproducible visual field defects on the Humphrey Field Analyser. Three 10°-images of the ONH were obtained using the HRT. The mean topography image was determined and the HRT software was used to calculate the rim volume, rim area to disc area ratio, normalised rim area to disc area ratio and retinal nerve fibre cross-sectional area for each patient at 10°-sectoral intervals. The values were ranked in descending order, and each ranked-segment curve of ordered values was fitted using the least squares method. Results: There was no difference in disc area between the groups. The group mean cup-disc area ratio was significantly lower in the normal group (0.204 ± 0.16) compared with the glaucoma group (0.533 ± 0.083) (p < 0.001). The visual field indices, mean deviation and corrected pattern S.D., were significantly greater (p < 0.001) in the glaucoma group (-9.09 dB ± 3.3 and 7.91 ± 3.4, respectively) compared with the normal group (-0.15 dB ± 0.9 and 0.95 dB ± 0.8, respectively). Univariate linear regression provided the best overall fit to the ranked segment data. The equation parameters of the regression line manually applied to the normalised rim area-disc area and the rim area-disc area ratio data, correctly classified 100% of normal subjects and glaucoma patients. In this study sample, the regression analysis of ranked segment parameters method was more effective than conventional ranked segment analysis, in which glaucoma patients were misclassified in approximately 50% of cases. Further investigation in larger samples will enable the calculation of confidence intervals for normality. These reference standards will then need to be investigated for an independent sample to fully validate the technique. Conclusions: Using a curve-fitting approach to fit ranked segment curves retains information relating to the topographic nature of neural loss. Such methodology appears to overcome some of the deficiencies of conventional ranked segment analysis, and subject to validation in larger scale studies, may potentially be of clinical utility for detecting and monitoring glaucomatous damage. © 2007 The College of Optometrists.
Resumo:
2000 Mathematics Subject Classification: 62J12, 62K15, 91B42, 62H99.
Resumo:
Analysis of risk measures associated with price series data movements and its predictions are of strategic importance in the financial markets as well as to policy makers in particular for short- and longterm planning for setting up economic growth targets. For example, oilprice risk-management focuses primarily on when and how an organization can best prevent the costly exposure to price risk. Value-at-Risk (VaR) is the commonly practised instrument to measure risk and is evaluated by analysing the negative/positive tail of the probability distributions of the returns (profit or loss). In modelling applications, least-squares estimation (LSE)-based linear regression models are often employed for modeling and analyzing correlated data. These linear models are optimal and perform relatively well under conditions such as errors following normal or approximately normal distributions, being free of large size outliers and satisfying the Gauss-Markov assumptions. However, often in practical situations, the LSE-based linear regression models fail to provide optimal results, for instance, in non-Gaussian situations especially when the errors follow distributions with fat tails and error terms possess a finite variance. This is the situation in case of risk analysis which involves analyzing tail distributions. Thus, applications of the LSE-based regression models may be questioned for appropriateness and may have limited applicability. We have carried out the risk analysis of Iranian crude oil price data based on the Lp-norm regression models and have noted that the LSE-based models do not always perform the best. We discuss results from the L1, L2 and L∞-norm based linear regression models. ACM Computing Classification System (1998): B.1.2, F.1.3, F.2.3, G.3, J.2.
Resumo:
2010 Mathematics Subject Classification: 68T50,62H30,62J05.
Resumo:
This paper explains how Poisson regression can be used in studies in which the dependent variable describes the number of occurrences of some rare event such as suicide. After pointing out why ordinary linear regression is inappropriate for treating dependent variables of this sort, we go on to present the basic Poisson regression model and show how it fits in the broad class of generalized linear models. Then we turn to discussing a major problem of Poisson regression known as overdispersion and suggest possible solutions, including the correction of standard errors and negative binomial regression. The paper ends with a detailed empirical example, drawn from our own research on suicide.
Resumo:
Annual average daily traffic (AADT) is important information for many transportation planning, design, operation, and maintenance activities, as well as for the allocation of highway funds. Many studies have attempted AADT estimation using factor approach, regression analysis, time series, and artificial neural networks. However, these methods are unable to account for spatially variable influence of independent variables on the dependent variable even though it is well known that to many transportation problems, including AADT estimation, spatial context is important. ^ In this study, applications of geographically weighted regression (GWR) methods to estimating AADT were investigated. The GWR based methods considered the influence of correlations among the variables over space and the spatially non-stationarity of the variables. A GWR model allows different relationships between the dependent and independent variables to exist at different points in space. In other words, model parameters vary from location to location and the locally linear regression parameters at a point are affected more by observations near that point than observations further away. ^ The study area was Broward County, Florida. Broward County lies on the Atlantic coast between Palm Beach and Miami-Dade counties. In this study, a total of 67 variables were considered as potential AADT predictors, and six variables (lanes, speed, regional accessibility, direct access, density of roadway length, and density of seasonal household) were selected to develop the models. ^ To investigate the predictive powers of various AADT predictors over the space, the statistics including local r-square, local parameter estimates, and local errors were examined and mapped. The local variations in relationships among parameters were investigated, measured, and mapped to assess the usefulness of GWR methods. ^ The results indicated that the GWR models were able to better explain the variation in the data and to predict AADT with smaller errors than the ordinary linear regression models for the same dataset. Additionally, GWR was able to model the spatial non-stationarity in the data, i.e., the spatially varying relationship between AADT and predictors, which cannot be modeled in ordinary linear regression. ^
Resumo:
Quantile regression (QR) was first introduced by Roger Koenker and Gilbert Bassett in 1978. It is robust to outliers which affect least squares estimator on a large scale in linear regression. Instead of modeling mean of the response, QR provides an alternative way to model the relationship between quantiles of the response and covariates. Therefore, QR can be widely used to solve problems in econometrics, environmental sciences and health sciences. Sample size is an important factor in the planning stage of experimental design and observational studies. In ordinary linear regression, sample size may be determined based on either precision analysis or power analysis with closed form formulas. There are also methods that calculate sample size based on precision analysis for QR like C.Jennen-Steinmetz and S.Wellek (2005). A method to estimate sample size for QR based on power analysis was proposed by Shao and Wang (2009). In this paper, a new method is proposed to calculate sample size based on power analysis under hypothesis test of covariate effects. Even though error distribution assumption is not necessary for QR analysis itself, researchers have to make assumptions of error distribution and covariate structure in the planning stage of a study to obtain a reasonable estimate of sample size. In this project, both parametric and nonparametric methods are provided to estimate error distribution. Since the method proposed can be implemented in R, user is able to choose either parametric distribution or nonparametric kernel density estimation for error distribution. User also needs to specify the covariate structure and effect size to carry out sample size and power calculation. The performance of the method proposed is further evaluated using numerical simulation. The results suggest that the sample sizes obtained from our method provide empirical powers that are closed to the nominal power level, for example, 80%.